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(N : 

Ultrahigh dimensional variable selection plays an increasingly 

s . 

, important role in contemp orary scienti f ic dis coveries and statisti- 



oo 



>< 



Fan and Lv 



(2003) propose an indepen- 



cal research. Among others, 

dent screening framework by ranking the marginal correlations. They 
showed that the correlation ranking procedure possesses a sure inde- 
pendence screening property within the context of the linear model 
with Gaussian covariates and responses. In this paper, we propose a 
■ more general version of the independent learning with ranking the 

^ , ■ maximum marginal likelihood estimates or the maximum marginal 

likelihood itself in ge neralized linear mo dels. We show that the pro- 



^ ' posed methods, with lFan and Lvl (|2008l l as a very special case, also 

' possess the sure screening property with vanishing false selection rate. 

(N ■ The conditions under which that the independence learning possesses 

■ 

' a sure screening is surprisingly simple. This justifies the applicability 

' of such a simple method in a wide spectrum. We quantify explicitly 

^— ■>) ^ the extent to which the dimensionality can be reduced by indepen- 

dence screening, which depends on the interactions of the covariance 
matrix of covariates and true parameters. Simulation studies are used 



^ , to illustrate the utility of the proposed approaches. In addition, we 
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establish an exponential inequality for the quasi-maximum likelihood 
estimator which is useful for high-dimensional statistical learning. 



1. Introduction. Ultrahigh dimensional regression problem is a signif- 
icant feature in many areas of modern scientific research using quantitative 
measurements such as microarrays, genomics, proteomics, brain images and 
genetic data. For example, in studying the associations between phenotypes 
such as height and cholesterol level and genotypes, it can involve millions of 
SNPs; in disease classification using microarray data, it can use thousands of 
expression profiles; and dimensionality grows rapidly when interactions are 
considered. Such a demand from applications brings a lot of challenge to sta- 
tistical inference, as the dimension p can grow much faster than the sample 
size n such that many models are not even identifiable. By non-polynomial 
dimensionality or simply NP-dimensionality, we mean logp = 0(n") for 
some a > 0. We will also loosely refer it to as an ultrahigh dimensionality. 
The phenomenon of noise accumulation in high-dimensional re gression has 



also b een o bserved by statistici ans and computer scientists. See 



Fan and Lv 



( 20081 ) and iFan and FanI (l2008i ) for a comprehensive review and references 
therein. When dimension p is ultrahigh, it is often assumed that only a small 
number of variables among predictors Xi, . . . , Xp contribute to the response, 
which leads to the sparsity of the parameter vector /3. As a consequence, 
variable selection plays a prominent role in high dimensional statistical mod- 
eling. 

Many variable selection techniques for various high dimensional statis- 
tical models have been proposed. Most of them are based on the penal- 
ized pseudo- likelihood app roach, to name a few, the bridge regression in 



Frank and FriedmanI mm \. the LASSO in iTibshiranil (|l99fil ). the SCAD 
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a nd other folded-concave penalty in 



Fan and Li (200 



m 



Candes and Tad (j2007l ) and their related methods (jZou 



), the Dantzig selector 



2000; 



Zou and Li 



20081). Theoretical studies of these methods conce ntrate on the persistency 



( Greenshte] 


n and Rito\ 




200/ 




van de Geei 




2008) 


properties ( 


Fan and Li. 


2001: 


Zou. 


2006 


). However, 



20081 ). consistency and oracle 



statistical learning problems, these methods may not perform well due to 
the simultaneous challenges of computat i onal expediency, statistical accu- 



ra cy and algor i thmic stability (iFan et al 



20081 ). 



Fan and Lvl (j2008l ) proposed a sure independent screening (SIS) method 



Huang et al 



Fan and Lv ( 



(20081) for a related 



20081 ) showed that 



to select important variables in ultrahigh dimensional linear models. Their 
proposed two-stage procedure can deal with the aforementioned three chal- 
lenges better than other methods. See also 
study based on a marginal bridge regression, 
the correlation ranking of features possesses a sure independence screening 
(SIS) property under certain conditions, that is, with probability very close 
to 1, the independence screening technique retains all of the important vari- 
ables in the model. However, the SIS procedure in lFan and Lvl (120081 ) only 



restricts to the ordinary linear models and their technical arguments depend 
heavily on the joint normality assumptions and can not easily be extended 
even within the context of a linear model. T his limits sign i ficant ly its use in 
practice which excludes categorical variables. Huang et al. j2008 ) also inves- 
tigate the marginal bridge regression in the ordinary linear model and their 
arguments depend also heavily on the explicit expressions of the least-square 
estimator and bridge regression. This calls for research on SIS procedures in 
more general models and under less restrictive assumptions. 

In this paper, we consider an independence learning by ranking the max- 
imum marginal likelihood estimator (MMLE) or maximum marginal like- 
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lihood itself for generalized linear models. That is, we fit p marginal re- 
gressions by maximizing the marginal likelihood with response Y and the 
marginal covariate Xj, i = 1, . . . ,p (and the intercept) each time. The mag- 
nitude of the absolute values of the MMLE can preserve the non-sparsity 
information of the joint regression models, provided that the true values 
of the marginal likelihood preserve the non-sparsity of the joint regression 
models and that the MMLE estimates the true values of the marginal likeli- 
hood uniformly well. The former holds under a surprisingly simple condition, 
whereas the latter requires a dev elopment o f unifo rm convergence over NP- 



dimensional marginal likelihoods. 



Hall et al 



(|2009l ) used a di fferent marginal 
Hall and MilleJ 



utility , derived from an empirical likelihood point of view. 
([20091) proposed a generalized correlation ranking, which allows nonlinear 
regression. Both papers proposed an interesting bootstrap method to assess 
the authority of the selected features. 

As the MMLE or maximum likelihood ranking is equivalent to the marginal 
correlation ranking in the ordinary linear mo dels, our work can thus be con- 
sidered as an important extension of SIS in iFan and Lvl (120081 ) . where the 



joint normality of the respon se and cova, r iates is imposed. Moreover, our 
results improve over these in iFan and Lvl (j2008l ) in at least three aspects. 



Firstly, we establish a new framework for having SIS properties, which does 
not build on the normality assumption even in the linear model setting. Sec- 
on dly, while it is not obvious (and could be hard) to generalize the proof 



of 



Fan and Lvl (j2008l ) to more complicated models, in the current frame- 



work, the SIS procedure can be applied to the generalized linear models 
and possibly other models. Thir dly, our results can eas ily be applied to the 



generalized correlation ranking (jHall and Milled . 



2003 ) and other rankings 



based on a group of marginal variables. 



imsart-aos ver. 2007/09/18 file: SISGLIM-revise-v3.tex date: January 18, 2010 



SIS IN GLIM 



Fitting ma rginal 



n iodels to a joint regression is a type of model mis- 



specification (jWhite 



19821 ) ■ since we drop out most covariates from the 



model fitting. In this paper, we establish a nonasymptotic tail probability 
bound for the MMLE under model misspecifications, which is beyond the 
traditional asymptotic framework of model misspecification and of interest 
in its own right. As a practical screening method, independent screening 
can miss variables that are marginally weakly correlated with the response 
variables, but jointly highly important to the response variables, and also 



ran k some joint 



unim portant variables too high by using marginal meth- 



ods. iFan and Lvl (l2008l ) and Fan et al. (2008) develop iteratively conditional 
screening and selection methods to make the procedures robust and practi- 
cal. The former focuses on ordinary linear models and the latter improves 
the idea in the former and expands significantly the scope of applicability, 
including generalized linear models. 

The SIS property can be achieved as long as the surrogate, in this case, 
the marginal utility, can preserve the non-sparsity of the true parameter 
values. With a similar idea. Fan et al. (2008) proposed a SIS procedure 
for generalized linear models, by sorting the maximum likelihood functions, 
which is a type of "marginal likelihood ratio" ranking, whereas the MMLE 
can be viewed as a Wald type of statistic. The two methods are equivalent 
in terms of sure screening properties in our proposed framework. This will 
be demonstrated in our paper. The key technical challenge in the maximum 
marginal likelihood ranking is that the signal can even be weaker than the 
noise. We overcome this technical difficulty by using the invariance property 
of ranking under monotonic transforms. 

The rest of the paper is organized as follows. In Section 2, we briefly 
introduce the setups of the generalized linear models. The SIS procedure 
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is presented in Section 3. In Section 4, we provide an exponential bound 
for quasi maximum likelihood estimator. The SIS properties of the MMLE 
learning are presented in Section 5. In Section 6, we formulate the marginal 
likelihood screening and show the SIS property. Some simulation results are 
presented in Section 7. A summary of our findings and discussions is in 
Section 8. The detailed proofs are relegated to Section 9. 

2. Generalized Linear Models. Assume that the random scalar Y is 
from an exponential family with the probability density function taking the 
canonical form 

(1) fY{y;e) = exp{ye-b{e) + c{y)}, 

for some known functions b{-), c(-) and unknown function 9. Here we do not 
consider the dispersion parameter as we only model the mean regression. We 
can easily introduce a dispersion parameter in ([1]) and the results continue 
to hold. The function 9 is usually called the canonical or natural parameter. 
The mean response is b'{6), the first derivative of b{9) with respect to 9. 
We consider the problem of estimating a {p + l)-vector of parameter f3 = 
(/3o,/3i, ■ ■ ■ ,/3p) from the following generalized linear model 

(2) E{Y\X = x) = 6'(e(x)) = 5-^E/3,x,), 

j=0 

where x = {xo,xi, . . . ,Xp}'^ is a (p+l)-dimensional covariate and xq = I 
represents the intercept. If g is the canonical link, i.e., g = {b')'^^, then 
^(^) ~ Si=o f^j-"^]- focus on the canonical link function in this paper for 
simplicity of presentation. 

Assume that the observed data {(Xj, 1^), i = 1, . . . , n} are i.i.d. copies of 
(X,y), where the covariate X = {Xq,Xi, . . . ,Xp) is a + l)-dimensional 
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random vector and Xq = 1 . We allow p to grow with n and denote it as pn 
whenever needed. 

We note that the ordinary linear model Y = X"^/3 + e with e ~ N{0, 1) 
is a special case of model ([2]), by taking g{n) = fj, and b{9) = 9'^/2. When 
the design matrix X is standardized, the ranking by the magnitude of the 
marginal correlation is in fact the same as the ranking by the magnitude 
of the maximum marginal likelihood estimator (MMLE). Next we propose 
an independence screening method to GLIM based on the MMLE. We also 
assume that the covariates are standardized to have mean zero and standard 
deviation one: 

EXj = 0, and EX] = 1, j = !,■■■ ,pn. 

3. Independence Screening vi^ith MMLE. Let = {1 < i < 

Pn ■ /3j 0} be the true sparse model with non-sparsity size s„ = |A^*|, 
where (3* = (/3q, /3^, . . . , f3*^) denotes the true value. In this paper, we refer 
to marginal models as fitting models with componentwise covariates. The 
maximum marginal likelihood estimator (MMLE) 0j , for j = 1, . . . ,pn, is 
defined as the minimizer of the componentwise regression: 

= (/3fo, Pf) = argmin^„,^/„/(/3o + /3jX„Y), 

where 1{Y; 9) = - [9Y - b{9) - log c{Y)] and P„/(X, Y) = n"! Y:7=i fi^u Yi) 
is the empirical measure. This can be rapidly computed and its implemen- 
tation is robust, avoiding numerical instability in NP-dimensional problems. 
We correspondingly define the population version of the minimizer of the 
componentwise regression, 

(3f = /3f ) = argmin^^,^^,i?/(/3o + PjX^^Y), for j = 1, . . . 

where E denotes the expectation under the true model. 
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We select a set of variables 

(3) ={l<j<Pn- |/3f I > 7n}, 

where 7„ is a predefined threshold value. Such an independence learning 
ranks the importance of features according to their magnitude of marginal 
regression coefficients. With an independence learning, we dramatically de- 
crease the dimension of the parameter space from p„ (possibly hundreds 
of thousands) to a much smaller number by choosing a large 7^, hence 
the computation is much more feasible. Although the interpretations and 
implications of the marginal models are biased from the joint model, the 
non-sparse information about the joint model can be passed along to the 
marginal model under a mild condition. Hence it is suitable for the purpose 
of variable screening. Next we will show under certain conditions that the 
sure screening property holds, i.e., the set belongs to A4j^ with prob- 
ability one asymptotically, for an appropriate choice of 7^. To accomplish 
this, we need the following technical device. 

4. An exponential bound for QMLE. In this section, we obtain an 
exponential bound for the quasi-MLE (QMLE), which will be used in the 
next section. Since this result holds under very general conditions and is 
of self-interest, in the following we make a more general description of the 
model and its conditions. 

Consider data {X.i,Yi}, i = l,...,n are n i.i.d. samples of (X,y) £ X x 
y for some space X and 3^. A regression model for X and Y is assumed 
with quasi- likelihood function -1{X.^I3, Y). Here F and X = (Xi, . . . , X^)^ 
represent the response and the g-dimensional covariate vector, which may 
include both discrete and continuous components and the dimensionality 
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can also depend on n. Let 

be the population parameter. Assume that Pq is an interior point of a suffi- 
ciently large, compact and convex set B € R"^. The following conditions on 
the model are needed: 



A. The Fisher information 
1(0} = E 



is finite and positive definite at (3 = /3q. Moreover, ||/(/3)||g = 
sup^gg ||x||=i exists, where || • || is the Euclidean norm. 

B. The function /(x-^/3,y) satisfies the Lipschitz property with positive 
constant kn- 

\l{^^P,y)-l{^^f3',y)\In{^,y) < A;„|x'^/3 - x^/3'|/n(x, y), 

for P,(3' € B, where 7„(x, y) = I((x, y) G with 

Qn = {(x,y) : ||x||oo < Kn, \y\ < K*}, 

for some sufficiently large positive constants Kn and Kn, and || * ||oo 
being the supremum norm. In addition, there exists a sufficiently large 
constant C such that with 6„ = CknVn^iq/ny^'^ and Vn given in 
Condition C 

sup \E[l{X^P,Y)-l{X.^(3o,Y)]il-In{X,Y))\ < oiq/n). 

/3eB, ||/3-A,ll<bn 

where Vn is the constant given in Condition C. 

C. The function 1{'X.'^ f3,Y) is convex in /3, satisfying 

E{l{X.^(3, Y) - /(X^/3o, Y)) > VnWP - Pof, 
for all 11/3 — /3o 1 1 < &n and some positive constants Vn- 
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Condition A is analogous to assumption A6 (b) of White (1982) and as 



sumption Rs in 



Fahrmeir and Kaufmann 



(|l986l ). It ensures the identifiabil- 



ity and the existence of the QMLE and is satisfied for many examples of 
generalized linear models. Conditions A and C are overlapped but not the 
same. 

We now establish an exponential bound for the tail probability of the 
QMLE: 

The idea of the proof is to connect -^/n||/3— /3q|| to the tail of certain empirical 
processes and utilize the convexity and Lipschitz continuities. 

Theorem 1. Under Conditions A-C, it holds that for any t > 0, 

P (x/^p - /3ol| > 16A:„(1 + t)/Vn) < exp(-2tVK2^ + nP(Q^). 

5. Sure Screening Properties with MMLE. In this section, we in- 
troduce a new framework for establishing the sure screening property with 
MMLE in the canonical exponential family ([1]). We divide into three sections 
to present our findings. 

5.1. Population Aspect. As fitting marginal regressions to a joint regres- 
sion is a type of model misspecification, an important question would be: 
at what level the model information is preserved. Specifically for screening 
purpose, we are interested in the preservation of the non-sparsity from the 
joint regression to the marginal regression. This can be summarized into the 
following two questions. First, for the sure screening purpose, if a variable 
Xj is jointly important (/3| ^ 0), will (and under what conditions) it still 
be marginally important (/3*^ 7^ 0)? Second, for the model selection consis- 
tency purpose, if a variable Xj is jointly unimportant (/?* = 0), will it still be 
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marginally unimportant (/3*^ = 0)? We aim to answer these two questions 
in this section. 

The following theorem reveals that the marginal regression parameter is 
in fact a measurement of the correlation between the marginal covariate and 
the mean response function. 

Theorem 2. For j = 1, . . . the marginal regression parameters 
j3f = if and only if cov{b' {X^ (3*), Xj) = 0. 

By using the fact that that X."^ (3* = /3q + -^jf^j^ can easily 

show the following corollary. 

Corollary 1. If the partial orthogonality condition holds, i.e., {Xj, j ^ 
Mi,} is independent of {Xi, i G M-i,}, then = 0, for j ^ 

This partial orthogonality condition is essentially the assumption made 
in Huang et al. (2008) who showed the model selection consistency in the 
special case with the ordinary linear model and bridge regression. Note that 
cov(6'(X"^/3*), Xj) = cov{Y, Xj). A necessary condition for sure screening 
is that the important variables Xj with (3* ^ are correlated with the 
response, which usually holds. When they are correlated with the response, 
by Theorem [21 /3j*^ ^ 0, for j G In other words, the marginal model 

pertains the information about the important variables in the joint model. 
This is the theoretical basis for the sure independence screening. On the 
other hand, if the partial orthogonality condition in Corollary 1 holds, then 
= for j Mi,. In this case, there exists a threshold 7.„ such that the 
marginally selected model is model selection consistent: 

min|/3f|>7„, max |/3f | = 0. 
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To have a sure screening property based on the sample version ([3]), we 
need 

min |/3f I > 0{n-''), 
j€Mi, ■' 

for some k < 1/2 so that the marginal signals are stronger than the stochas- 
tic noise. The following theorem shows that this is possible. 

Theorem 3. If\cov{b'{X.^l3*),Xj)\ > cin for J € and a posi- 
tive constant ci > 0, then there exists a positive constant C2 such that 

min |/3f I > 0271-", 

provided that b"{-) is bounded or 

EG{a\Xj\)\Xj\I{\Xj\ > n^) < dn"", for some < rj < k, 

and some sufficiently small positive constants a and d, where G{\x\) = 
sup|„|<|^| \b'{u)\. 

Note that for the normal and Bernoulli distribution, b"{-) is bounded, 
whereas for the Poisson distribution, Gdxl) = exp(|3;|) and Theorem 3 re- 
quires the tails of Xj to be light. Under some additional conditions, we will 
show in the proof of Theorem [5] that 

X;i/3f|' = 0(||Srf) = 0(A„.ax(S)), 
i=i 

where 5] = var(X), and Amax(S) is its maximum eigenvalue. The first equal- 
ity requires some efforts to prove, whereas the second equality follows easily 
from the assumption 

Yax{yi^f3*) = (3*^i:f3* = 0(1). 
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The implication of this result is that there can not be too many variables 
that have marginal coefficient that exceeds certain thresholding level. 
That achieves the sparsity in final selected model. 

When the covariates are jointly normally distributed, the condition of 
Theorem [3] can be further simplified. 

Proposition 1. Suppose that X and Z are jointly normal with mean 
zero and standard deviation 1. For a strictly monotonia function /(•), cov{X, Z) 
if and only if cov{X, f{Z)) = 0, provided the latter covariance exists. In 
addition, 

\cov{X,f{Z))\ > \p\ inf \g'{x)\ EX^I{\X\ < c), 

\x\<c\p\ 

for any c > 0, where p = EXZ , g{x) = Ef{x + e) with e ~ A^(0, 1 — p^). 

The above proposition shows that the covariance of X and f{Z) can be 
bounded from below by the covariance between X and Z, namely 

\cov{XJ\Z))\ > d\p\, d= inf \g'{x)\ EX'^I{\X\ < c), 

\x\<c 

in which d > for a sufficiently small c. The first part of the proposition 
actually holds when the conditional density f(z\x) of Z given X is a mono- 



tonic likelihood family (jBickel and Doksum 



2001 



) when x is regarded 



parameter. By taking Z = X a direct application of Theorem [2] is that 
/3j^ = if and only if 

cov(X^/3*,Xj) = 0, 

provided that X is jointly normal, since b'(-) is an increasing function. Fur- 
thermore, if 



(4) \cov(X'^P*,Xj)\ > con-^, k < 1/2, 
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for some positive constant cq, a minimum condition required even for the 
least-squares model (Fan and Lv, 2008), then by the second part of Propo- 
sition [H we have 

|cov(6'(X^/3^),Xj)| > cin-^, 

for some constant ci. Therefore, by Theorem [21 there exists a positive con- 
stant C2 such that 

|/3f I > C2n-^ 

In other words, ([4]) suffices to have marginal signals that are above the 
maximum noise level. 

5.2. Uniform convergence and sure screening. To establish the SIS prop- 
erty of MMLE, a key point is to establish the uniform convergence of the 
MMLEs. That is, to control the maximum noise level relative to the sig- 
nal. Next we establish the uniform convergence rate for the MMLEs and 
sure screening property of the method in ([3]). The former will be useful in 
controlling the size of the selected set. 

Let (3j = l^j)'-^ denote the two-dimensional parameter and Xj = 

{l,Xj)'^ . Due to the concavity of the log-likelihood in GLIM with the canon- 
ical link, El{J^J (3j,Y) has a unique minimum over Pj G B at an interior 
point /3f = iPf^o,Pf)^, where B = {|/3j|f,| < B, \(3f\ < 5} is a square with 
the width B over which the marginal likelihood is maximized. The following 
is an updated version of Conditions A, B and C for each marginal regres- 
sion and two additional conditions for the covariates and the population 
parameters: 

A'. The marginal Fisher information: = -B {6"(Xj/3j)XjXj| is 

finite and positive definite at Pj = (3^ , for j = 1, . . . ,pn- Moreover, 
||/j(/3j)||e is bounded from above. 
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B'. The second derivative of b{9) is continuous and positive. There exists 
an ei > such that for aU j = 1, . . . ,pn, 

sup \Eb(X.Jp)I{\Xj\ > Kr,)\ < o(n-i). 

/3es, ||/3-/3f ||<ei 

C. For all pj G B, we have E{l{X.J (3j,Y)-l(Xj f3f ,Y)) > 1/||/3j -/3f f , 
for some positive V, bounded from below uniformly over j = 1, . . . ,pn. 

D. There exists some positive constants mo, mi, sq, si and a, such that 
for sufficiently large t, 

Pi\Xj\ >t)< (mi - si)exp{-mot"}, for j = 1, . . . 

and that 

£;exp(6(X^/3*+so)-6(X'^/3*))+^;exp(6(X^/3*-so)-KX^/3*)) < ^i- 

E. The conditions in Theorem [3] hold. 

Conditions A'-C are satisfied in a lot of examples of generalized linear 
models, such as linear regression, logistic regression and Poisson regression. 
Note that the second part of Condition D ensures the tail of the response 
variable Y to be exponentially light, as shown in the following lemma: 

Lemma 1. // Condition D holds, for any t > 0, 

P{\Y\ > mot"/so) < siexp(-mot"). 

Let kn = b' {KnB+B)+mQK^ / sq. Then Condition B holds for exponential 
family ([T|) with K* = moK^/sQ. The Lipschitz constant kn is bounded for 
the logistic regression, since Y and b'{-) are bounded. The following theorem 
gives a uniform convergence result of MMLEs and a sure screening property. 
Interestingly, the sure screening property does not directly depend on the 

imsart-aos ver. 2007/09/18 file: SISGLIM-revise-v3.tex date: January 18, 2010 



16 

property of the covariance matrix of the covariates such as the growth of its 
operator norm. This is an advantage over using the full likelihood. 

Theorem 4. Suppose that Conditions A', B', C and D hold. 

(i) If n^^^'^ / (k^K^) — )■ oo, then for any C3 > 0, there exists a positive 
constant C4 such that 

p( max |/3f -/3f I > C3n"A 
< pn{exp{-C47^^~^'^/{knKnf) + nmi exp(-moi^")}. 

(ii) If, in addition, Condition E holds, then by taking 7„ = c^n~'^ with 
C5 < C2/2, we have 



p(M^ C M^,^j > 1 - s„{exp(-C4n^ 2«/(A;„i<:„)2) + nmi exp(-moK")}, 
where Sn = the size of non- sparse elements. 

Remark 1. If we assume that miiij^M, \ cov{b' (X.^ P*) , Xj)\ > cin~'^~^^ 
for any 5 > 0, then one can take 7„ = cn"'^^'^/^ for any c > in Theorem^ 
This is essentially the thresholding used in Fan and Lv (2008). 

Note that when h'{-) is bounded as the Bernoulli model, kn is a finite 
constant. In this case, by balancing the two terms in the upper bound of 
Theorem m^i) , the optimal order of Kn is given by 

and 

P (^^max |/3f - /3f I > csn"'^) = O {p„ exp(-C4n(i-2'^)"/(-+2))| ^ 
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for a positive constant C4. When the covariates Xj are bounded, then kn and 
Kn can be taken as finite constants. In this case, 

P ( max |/3f - /3f I > 03?^"'') < O {p„ exp(-C4n^-2'=)| . 

In both aforementioned cases, the tail probabihty in Theorem H] is exponen- 
tiahy small. In other words, we can handle the NP-dimensionality: 

logp„ = o(n(^-2«W(.+2)^^ 

with Q = 00 for the case of bounded covariates. 

For the ordinary linear model, = B{Kn + 1) + / (2so) and by taking 
the optimal order of Kn = n^^~'^'^^^^ with A = max(a + 4, 3a + 2), we have 

P (^^max l^f - /3f I > C3n~^^ = O {p„ exp(-C4n(i"2K)aM)| _ 

When the covariates are normal, a = 2 and our result is weaker than that 
given in Fan and Lv (2008) who permits logp„ = o(?i^~^'*) whereas Theo- 
rem [H can only handle logp„ = o(n^^~^'^^/^). However, we allow non-normal 
covariate and other error distributions. 

The above discussion applies to the sure screening property given in The- 
orem m^ii) . It is only the size of non-sparse elements s„ that matters for the 
purpose of sure screening, not the dimensionality Pn- 

5.3. Controlling false selection rates. After applying the variable screen- 
ing procedure, the question arrives naturally how large the set M'y„ is. In 
other words, has the number of variables been actually reduced by the in- 
dependence learning? In this section, we aim to answer this question. 

A simple answer to this question is the ideal case in which 

cov(6'(X^/3*), Xj) = o{n~''), for j ^ M^. 
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In this case, under some mild conditions, we can show (see the proof of 
Theorem [3]) that 



max |/3f I = oin-"). 

This, together with Theorem [^i) shows that 

max < C3n~'^, for any C3 > 0, 

with probabihty tending to one if the probabihty in Theorem ID^i) tends to 
zero. Hence, by the choice of 7„ as in Theorem Hl^ii) , we can achieve model 
selection consistency: 

P{M,^=M.) = l-o{l). 



This kind of condition was indeed implied by the condition in iHuang et al. 
(|2008l ) in the special case with ordinary linear model using the bridge re- 
gression who draw a similar conclusion. 

We now deal with the more general case. The idea is to bound the size of 
the selected set ^ by using the fact var(y) is bounded. This usually implies 
var(X-^/3*) = l3*'^'E(3* = 0(1). We need the following additional conditions: 

F. The variance var is bounded fr om above and below. 

G. Either b"{-) is bounded or Xm = (^1, • • • , ^p„)"^ follows an elliptically 
contoured distribution, i.e., 

Xm = ^y^Ru, 

and \Eb' (X.'^ P*)(X.'^ (3* — /3o)| is bounded, where U is uniformly dis- 
tributed on the unit sphere in p-dimensional Euclidean space, inde- 
pendent of the nonnegative random variable R, and Si = var(XA./). 

Note that S = diag(0, Si) in Condition G', since the covariance matrices 
differ only in the intercept term. Hence, Amax(S) = Amax(Si). The following 
result is about the size of A4j^. 
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Theorem 5. Under Conditions A', B', C\ D, F and G, we have for 
c-f^y In = csn"^'', there exists a C4 such that 

> 1 - p„{exp(-C4n^~^''/ (knKnf) + nmi exp{-moK^)}. 

The right hand side probability has been explained in Section 5.2. From 
the proof of Theorem [5l we actually show that the number of selected vari- 
ables is of order ||S/3*|p/7^, which is further bounded by 0{n^'^ Ainax(S)} 
using var(X^/3*) = 0(1). Interestingly, while the sure screening property 
does not depend on the behavior of 5], the number of selected variables is 
affected by how correlated the covariates are. When n2'^Amax(S)/p 0, the 
number of selected variables are indeed negligible comparing to the original 
size. In this case, the percent of falsely discovered variables is of course neg- 
ligible. In particular, when Amax(S) = 0{n'^), the size of selected variable 
is of order 0(?i^'^+'^). This is of the same order as in Fan and Lv (2008) for 
the multiple regression model with the Gaussian data who needs additional 
condition that 2k + t < 1. Our result is an extension of Fan and Lv (2008) 
even in this very specific case without the condition 2k + t < 1. In addition, 
our result is more intuitive: the number of selected variables is related to 
Amax(S), or more precisely ||S/3*|p and the thresholding parameter 7^. 

6. A likelihood ratio screening. In a similar variable screening prob- 
lem with generalized linear models. Fan et al. (2008) suggest to screen the 
variables by sorting the marginal likelihood. This method can be viewed as 
a marginal likelihood ratio screening, as it builds on the increments of the 
log- likelihood. In this section we show that the likelihood ratio screening is 
equivalent to the MMLE screening in the sense that they both possess the 
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sure screening property and that the number of selected variables of the two 
methods are of the same order of magnitude. 

We first formulate the marginal likelihood screening procedure. Let 

%n = Pn{K/3o', Y) - l{X.Jpf, Y)], j = l,...,pn, 

and Jjn = {Li^n, • • • , Lp^„n)^ , where = argmin^gP„/(/3o, Y). Correspond- 
ingly, let 

L* = E[lil3^,Y) - l{X.Jpf,Y)], j = l,...,pn, 

and L* = (L^, . . . , L*^)'^, where (3q^ = argmin^j,i?/(/3o, Y). It can be shown 
that EY = h'{(5Q^) and that Y = 6'(/3q^), where Y is the sample average. 
We sort the vector L„ in a descent order and select a set of variables 

AC„ = {1 < J < : Lj^n > Vn], 

where Un is a predefined threshold value. Such an independence learning 
ranks the importance of features according to their marginal contributions to 
the magnitudes of the likelihood function. The marginal likelihood screening 
and the MMLE screening share a common computation procedure as solving 
Pn optimization problems over a two dimensional parameter space. Hence 
the computation is much more feasible than traditional variable selection 
methods. 

Compared with MMLE screening, where the information utilized is only 
the magnitudes of the estimators, the marginal likelihood screening incor- 
porates the whole contributions of the features to the likelihood increments: 
both the magnitudes of the estimators and their associated variation. Under 
current condition (Condition C), the variance of the MMLEs are at a com- 
parable level (through the magnitude of an implication of the convexity 
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of the objective functions), and the two screening methods are equivalent. 
Otherwise, if V depends on n, the marginal likelihood screening can still 
preserve the non-sparsity structure, while the MMLE screening may need 
some corresponding adjustments, which we will not discuss in detail as it is 
beyond the scope of the current paper. 

Next we will show that the sure screening property holds under certain 
conditions. Similarly to the MMLE screening, we first build the theoreti- 
cal foundation of the marginal likelihood screening. That is, the marginal 
likelihood increment is also a measurement of the correlation between the 
marginal covariate and the mean response function. 

Theorem 6. For j = l,...,pn, the marginal likelihood increment 
L* = if and only if cov{b'{X.^ fi*), Xj) = 0. 

As a direct corollary of Theorem [H we can easily show the following 
corollary for the purpose of model selection consistency. 

Corollary 2. If the partial orthogonality condition in Corollary m 
holds, then L* = 0, for j ^ Mi,. 

We can also strengthen the result of minimum signals as follows. On the 
other hand, we also show that the total signals can not be too large. That 
is, there can not be too many signals that exceed certain threshold. 

Theorem 7. Under the conditions in Theorem and the Condition 
C, we have 

min \L*\> Cgn"^'^, 
jeM* ■' 

for some positive constant cg, provided that | cou(6'(X"^/3*), > cin~'^ for 
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j G Ai-k- If, in addition, Conditions F and G hold, then 

||L*|| = 0(||/3^^f ) = 0(||S/3*f) = 0(A^ax(S)). 

The technical challenge is that the stochastic noise ||L.„ — L*||oo is usually 
of the order of 0{n~'^'^ + n~^/^ logp„), which can be an order of magnitude 
larger than the signals given in Theorem [71 unless k < 1/4. Nevertheless, 
by a different trick that utilizes the fact that ranking is invariant under a 
strict monotonic transform, we are able to demonstrate the sure screening 
independence property for n < 1/2. 

Theorem 8. Suppose that Conditions A', B', C and D, E and F 
hold. Then, by taking Un = c-ju"'^'^ for a sufficiently small C7 > 0, there 
exists a cg > such that 

p(m^ C A„) > 1 - Sn{exp{-C8n^-'^''/{knKnf) + nmi exp(-moi^")}. 

Similarly to the MMLE screening, we can control the size of AC^ as follows. 
For simplicity of the technical argument, we focus only on the case where 
5"(-) is bounded. 

Theorem 9. Under Conditions A', B', C\ D, F and G, ifb"{-) is 
bounded, then we have 

P[|AA,J < 0{n2-A„,ax(S)}] 
> 1 - pn{exp(-C8?i^~^''/ {knKnf) + nrui exp(-moA'^)}. 

7. Numerical Results. In this section, we present several simulation 
examples to evaluate the performance of SIS procedure with generalized lin- 
ear models. It was demonstrated in Fan and Lv (2008) and Fan et al. (2008) 
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that independent screening is a fast but crude method of reducing the dimen- 
sionality to a more moderate size. Some methodological extensions include 
iterative SIS (ISIS) and multi-stage procedures, such as SIS-SCAD and SIS- 
LASSO, can be applied to perform the final variable selection and parameter 
estimation simultaneously. Extensive simulations on these procedures were 
also presented in Fan et al. (2008). To avoid repetition, in this paper, we 
focus on the vanilla SIS, with the aim to evaluate the sure screening prop- 
erty and to demonstrate some factors influencing the false selection rate. 
We vary the sample size from 80 to 1000 for different scenarios to gauge the 
difficulties of the simulation models. The following three configurations with 
p = 2000, 5000 and 40000 predictor variables are considered for generating 
the covariates X = {Xi, • • • , Xp)^: 

SI. The covariates are generated according to 



(5) X, 



l + a| 



where e and {ej}^^lf are i.i.d. standard normal random variables, 
{^j}ji03]+i i.i.d. and follow a double exponential distributions 
with location parameter zero and scale parameter one, and pp/sj+i 
are i.i.d. and follow a mixture normal distribution with two compo- 
nents A^( — 1,1), A^(l,0.5) and equal mixture proportion. The con- 
stants {aj}'j^^ are the same and chosen such that the correlation 
p = corr{Xi, Xj) = 0, 0.2, 0.4, 0.6 and 0.8, among the first q vaari- 
ables, and aj = for j > q. The parameter q is also a measurement of 
the correlations among the covariates. We will present the numerical 
results with q = 15 for this setting. The covariates are standardized to 
be mean zero and variance one. 
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Table 1 

The median of the 200 empirical maximum eigenvalues, with its robust estimate of SD in 
the parenthesis, of the corresponding sample covariance matrices of covariates based 200 
simulations with partial combinations of n = 80, 300, p — 2000, 5000, 40000 and 
q — 15,50 in the first two settings (SI and S2). 



p 


n 


Setting 





0.2 


P 

0.4 


0.6 


0.8 


40000 


80 


SI (q= 


= 15) 


549.9(1.42) 


550.1(1.36) 


550.1(1.31) 


550.1(1.29) 


550.1(1.40) 


40000 


80 


S2 (q= 


=50) 


550.0(1.42) 


550.1(1.38) 


550.4(1.48) 


552.9(1.77) 


558.5(2.35) 


40000 


300 


SI (q= 


=15) 


157.3(0.36) 


157.4(0.36) 


157.4(0.37) 


157.4(0.31) 


157.7(0.41) 


40000 


300 


S2 (q= 


=50) 


157.4(0.36) 


157.5(0.37) 


160.9(1.22) 


168.2(1.03) 


176.9(1.01) 


5000 


300 


SI (q= 


=15) 


25.68(0.15) 


25.68(0.17) 


26.18(0.24) 


27.99(0.39) 


30.28(0.37) 


5000 


300 


SI (q= 


=50) 


25.69(0.14) 


29.06(0.48) 


37.98(0.73) 


47.49(0.72) 


57.17(0.45) 


2000 


600 


SI (q= 


=15) 


7.92(0.07) 


8.32(0.15) 


10.5(0.26) 


13.09(0.25) 


15.79(0.20) 


2000 


600 


SI (q= 


=50) 


7.93(0.07) 


14.62(0.40) 


23.95(0.65) 


33.90(0.60) 


43.56(0.45) 


2000 


600 


S2 (q= 


=50) 


7.93(0.07) 


14.62(0.40) 


23.95(0.65) 


33.90(0.60) 


43.56(0.45) 



52. The covariates are also generated from ([5]), except that {ajYj^-^ are 
i.i.d. normal random variables with mean a and variance 1 and aj = 
for j > q. The value of a is taken such that Ecorr{Xi, Xj) = 
0, 0.2, 0.4, 0.6 and 0.8, among the first q variables. The simulation 
results to be presented for this setting use q = 50. 

53. Let {Xj}^~f^ be i.i.d. standard normal random variables and 

s 

Xfc = J]X,(-iy+V5 + ^/25^/5efc, k=p-A9,--- ,p, 

i=i 

where {efc}fc=p_49 are standard normally distributed. 

Table 1 summarizes the median of the empirical maximum eigenvalues 
of the covariance matrix and its robust estimate of the standard deviation 
(RSD) based 200 simulations in the first two settings (SI and S2) with par- 
tial combinations of sample size n = 80, 300, 600, p = 2000, 5000, 40000 and 
q = 15, 50. RSD is the interquantile range (IQR) divided by 1.34. The empir- 
ical maximum eigenvalues are always larger than their population version, 
depending on the realizations of the design matrix. The empirical minimum 
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eigenvalue is always zero, and the empirical condition numbers for the sample 
covariance matrix are infinite, since p > n. Generally, the empirical maxi- 
mum eigenvalues increase as the correlation parameters p, q, the numbers 
of covariates p increase, and/or the sample sizes n decrease. 

With these three settings, we aim to illustrate the behaviors of two SIS 
procedures under different correlation structures. For each simulation and 
each model, we apply the two SIS procedures, the marginal MLE and the 
marginal likelihood ratio methods, to screen variables. The minimum model 
size (MMS) required for each method to have a sure screening, i.e. to contain 
the true model M-^, is used as a measure of the effectiveness of a screen- 
ing method. This avoids the issues of choosing the thresholding parameter. 
To gauge the difficulty of the problem, we also include the LASSO and the 
SCAD as references for comparison when p = 2000 and 5000. The smaller p 
is used due to the computation burden of LASSO and SCAD. In addition, 
as demonstrated in our simulation results, they do not perform well when p 
is large. Our initial intension is to demonstrate that the simple SIS does not 
perform much worse than the far more complicated procedures like LASSO 
and SCAD. To our surprise, the SIS can even outperform those more compli- 
cated methods in terms of variable screening. Again, we record the MMS for 
the LASSO and the SCAD for each simulation and each model, which does 
not depend on the choice of regularization parameters. When the LASSO or 
the SCAD can not recover the true model even with the smallest regulariza- 
tion parameter, we average the model size with the smallest regularization 
parameter and p. These interpolated MMS are presented with italic font in 
Tables 2 and 3 to distinct from the real MMS. Results for logistic regressions 
and linear regressions are presented in the following two subsections. 
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7.1. Logistic Regressions. The generated data (X^, Yi)^ . . . , (X^, Yn) are 
n i.i.d. copies of a pair (X^, Y), in which the conditional distribution of the 
response Y given X = x is binomial distribution with probability of success 
p(x) = exp(x"^/3*)/[l + exp(x"^/3*)]. We vary the size of the nonsparse set 
of coefficients as s = 3, 6, 12, 15 and 24. For each simulation, we evaluate 
each method by summarizing the median minimum model size (MMMS) of 
the selected models as well as its associated RSD, which is the associated 
interquartile range (IQR) divided by 1.34. The results, based on 200 simula- 
tions for each scenario are recorded in the second and third panel of Tables 
2 and 3. Specifically, Table 2 records the MMMS and the associated RSD 
for SIS when p = 40000, while Table 3 records these results for SIS, the 
LASSO and the SCAD when p = 2000 and 5000. The true parameters are 
also recorded in Tables 2 and 3. 

To demonstrate the difficulty of our simulated models, we depict the dis- 
tribution, among 200 simulations, of the minimum |t|-statistics of s esti- 
mated regression coefficients in the oracle model in which the statistician 
does not know that all variables are significant. This shows the difficulty to 
recover all significant variables even in the oracle model with the minimum 
model size s. The distribution was computed for each setting and scenario 
but only a few selected settings are shown presented in Figure 1. In fact, 
the distributions under setting 1 are very similar to those under setting 2 
when the same q value is taken. It can be seen that the magnitude of the 
minimum |t|-statistics is reasonably small and getting smaller as the correla- 
tion within covariates (measured by p and q) increases, sometimes achieving 
three decimals. Given such small signal-to-noise ratio in the oracle models, 
the difficulty of our simulation models is a self-evident even if the signals 
seem not that small. 
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Fig 1. The boxplots of the minimum \t\-statistics in the oracle models among 200 simula- 
tions for the first setting (SI) with logistic regression examples with f3* = (3, 4, . . .)"^ when 
s = 12, 24, q = 15, 50, n = 600 and p = 2000. The triplets under each plot represent the 
corresponding values of (s, q, p) respectively. 
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The MMMS and RSD with fixed correlation (SI) and random correla- 
tion (S2) are comparable under the same q. As the correlation increases 
and/or the nonsparse set size increases, the MMMS and the associated RSD 
usually increase for all SIS, the LASSO and the SCAD. Among all the de- 
signed scenarios of Settings 1 and 2, SIS performs well, while the LASSO 
and the SCAD occasionally fail under very high correlations and relatively 
large nonsparse set size (s=12, 15 and 24). Interestingly, correlation within 
covariates can sometimes help SIS reduce the false selection rate, as it can 
increase the marginal signals. It is notable that the LASSO and the SCAD 
usually can not select the important variables in the third setting, due to 
the violation of the irrepresentable condition for s = 6, 12 and 24, while SIS 
perform reasonably well. 

7.2. Linear Models. The generated data (X^, Yi), . . . , (X^, 1^) are n 
i.i.d. copies of a pair (X^,y), in which the response Y follows a linear 
model with Y = X"^/3* + e, where the random error e is standard normally 
distributed. The covariates are generated in the same manner as the logis- 
tic regression settings. We take the same true coefficients and correlation 
structures for part of the scenarios {p = 40000) as the logistic regression 
examples, while vary the true coefficients for other scenarios, to gauge the 
difficulty of the problem. The sample size for each scenario is correspond- 
ingly decreased to reflect the fact that the linear model is more informative. 
The trend of the MMMS and the associated RSD of SIS, the LASSO and the 
SCAD varying with the correlation and/or the nonsparse set size are sim- 
ilar with these in the linear regression examples, but their magnitudes are 
usually smaller in the linear regression examples, as the model is more infor- 
mative. Overall, the SIS does a very reasonable job in screening irrelevant 
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Table 2 

The MMMS (Median Minimum Model Size) and the associated RSD (in the parenthesis) 
of the simulated examples for logistic regressions when p — 40000. 



p 




010i.Vi.J_Ji.lj 


SIS-MMLE 


n 


SIS-MLR 


SIS-MMLE 








Setting 1, q — 


15 






S - 


_ q a* — 

- o, p — 


(1,1.3,1)^ 


s = 


6, /3* = (1,1,3,1,...)^ 


n 
u 


ouu 




89(375) 


300 


47(164) 


50(170) 


0.2 


200 


3(0) 


3(0) 


300 


6(0) 


6(0) 


0.4 


200 


3(0) 


3(0) 


300 


7(1) 


7(1) 


0.6 


200 


3(1) 


3(1) 


300 


8(1) 


8(2) 


0.8 


200 


4(1) 


4(1) 
(1,1.3,...)^ 


300 


9(3) 


9(3) 
(1,1.3,...)^ 




s = 


12, 13* = 


s - 


= 15, (3* = 





500 


297(589) 


302.5(597) 


600 


350(607) 


359.5(612) 


0.2 


300 


13(1) 


13(1) 


300 


15(0) 


15(0) 


0.4 


300 


14(1) 


14(1) 


300 


15(0) 


15(0) 


0.6 


300 


14(1) 


14(1) 


300 


15(0) 


15(0) 


0.8 


300 


14(1) 


14(1) 


300 


15(0) 


15(0) 








Setting 2, q — 


50 






s - 


= 3, /3* = 


(1,1.3,1)"^ 


s = 


6, /3* = (1,1.3,1,...)^ 





300 


84.5(376) 


88.5(383) 


500 


6(1) 


6(1) 


0.2 


300 


3(0) 


3(0) 


500 


6(0) 


6(0) 


0.4 


300 


3(0) 


3(0) 


500 


6(1) 


6(1) 


0.6 


300 


3(1) 


3(1) 


500 


8.5(4) 


9(5) 


0.8 


300 


5(4) 


5(4) 
(1,1.3,...)^ 


500 


13.5(8) 


14(8) 
(1,1.3,...)^ 




s = 


12, 13* = 


s - 


- 15, 13* = 





600 


77(114) 


78.5(118) 


800 


46(82) 


47(83) 


0.2 


500 


18(7) 


18(7) 


500 


26(6) 


26(6) 


0.4 


500 


25(8) 


25(10) 


500 


34(7) 


33(8) 


0.6 


500 


32(9) 


31(8) 


500 


39(7) 


38(7) 


0.8 


500 


36(8) 


35(9) 


500 


40(6) 


42(7) 
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variables and sometimes outperforms the LASSO and the SCAD. 

8. Conclusion Remarks. In this paper, we propose two independent 
screening methods by ranking the maximum marginal likelihood estima- 
tors and the max i mum marginal likelihood in generalized linear models. 



With iFan and Lvl (|2008l ) as a special case, the proposed method is shown 
to possess the sure independence screening property. The success of the 
marginal screening embarks the idea that any surrogates screening, besides 
the marginal utility screening introduced in this paper, as long as which 
can preserve the non-sparsity structure of the true model and is feasible in 
computation, can be a good option for population variable screening. It also 
paves the way for the sample variable screening, as long as the surrogate 
signals are uniformly distinguishable from the stochastic noise. Along this 
line, many statistics, such as R square statistics, marginal pseudo likelihood 
(least square estimation, for example), can be potential basis for the inde- 
pendence learning. Meanwhile the proposed properties: sure screening and 
vanishing false selection rate will be good criteria for evaluating ultrahigh 
dimensional variable selection methods. 

As our current results only hold when the log-likelihood function is con- 
cave in the regression parameters, the proposed procedure does not cover all 
generalized linear models, such as some noncanonical link cases. This leaves 
space for future research. 

Unlike Fan and Lv (2008), the main idea of our technical proofs is broadly 
applicable. We conjecture that our results should hold when the conditional 
distribution of the outcome Y given the covariates X depends only on 
and is arbitrary and unknown otherwise. Therefore, besides GLIM, the SIS 
method can be applied to a rich class of general regression models, includ- 
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Table 3 

The MMMS (Median Minimum Model Size) and the associated RSD (in the parenthesis) 
of the simulated examples for logistic regressions. The values with italic font indicate that 
the LASSO or the SCAD can not recover the true model even with smallest regularization 
parameter and are estimated. M-Amax a-nd its RSD have the same meaning as Table 1. 



p 


n 


SIS-MLR 


SIS-MMLE LASSO 


SCAD 


n 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 












Setting 1, p=5000, q=15 












s = 


- 3, p — 


1 1 






s = 


= 6, /3* = (1,1.3,...)^ 







300 


3(0) 






3(1) 


300 


12.5(15) 


13(16) 


7(1) 


6(1) 


0.2 


300 


3(0) 




■i(U) 


3(0) 


300 


6(0) 


6(0) 


6(0) 


6(0) 


0.4 


300 


3(0) 




3(0) 


3(0) 


300 


6(1) 


6(1) 


6(1) 


6(0) 


0.6 


300 


3(0) 


3(0) 


3(0) 


3(1) 


300 


7(2) 


7(2) 


7(1) 


6(1) 


0.8 


300 


3(1) 




4(1) 
1,1.3,...)^ 


4(1) 


300 


9(2) 


9(3) 


27.5(3725) 
(3,4,...)^ 


6(0) 






s = 


12, (3* = ( 






s - 


= 15, /3* = 







300 


297.5(359) 


300(361) 


72.5(3704) 


12(0) 


300 


479(622) 


482(615) 


69.5(68) 


15(0) 


0.2 


300 


13(1) 


13fll 


12(1) 


12(0) 


300 


15(0) 


15(0) 


16(13) 


15(0) 


0.4 


300 


14(1) 


14(1) 


14(1861) 


13(1865) 


300 


15(0) 


15(0) 


38(3719) 


15(3720) 


0.6 


300 


14(1) 


14(1) 


2552(85) 


12(3721) 


300 


15(0) 


15(0) 


2555(87) 


15(1472) 


0.8 


300 


14(1) 


14(1) 


2556(10) 


12(3722) 


300 


15(0) 


15(0) 


2552(8) 


15(1322) 












Setting 2, p=2000, q=50 














= 3, r = 


(3,4,3)^ 








= 6, /3* = (3, -3, . . .)^ 







200 


3(0) 


3(0) 


3(0) 


3(0) 


200 


8(6) 


9(7) 


7(1) 


7(1) 


0.2 


200 


3(0) 


3(0) 


3(0) 


3(0) 


200 


18(38) 


20(39) 


9(4) 


9(2) 


0.4 


200 


3(0) 


3(0) 


3(0) 


3(1) 


200 


51(77) 


64.5(76) 


20(10) 


16.5(6) 


0.6 


200 


3(1) 


3(1) 


3(1) 


3(1) 


300 


77.5(139) 


77.5(132) 


20(13) 


19(9) 


0.8 


200 


5(5) 


5.5(5) 


6(4) 
(3,4,...)^ 


6(4) 


400 


306.5(347) 


313(336) 


86(40) 
(3,4,...)^ 


70.5(35) 






s = 


= 12, /3* = 






s = 


. 24, = 







600 


13(6) 


13(7) 


12(0) 


12(0) 


600 


180(240) 


182(238) 


35(9) 


31(10) 


0.2 


600 


19(6) 


19(6) 


13(1) 


13(2) 


600 


45(4) 


45(4) 


35(27) 


32(24) 


0.4 


600 


32(10) 


30(10) 


18(3) 


17(4) 


600 


46(3) 


47(2) 


1099(17) 


1093(1456) 


0.6 


600 


38(9) 


38(10) 


22(3) 


22(4) 


600 


48(2) 


48(2) 


1078(5) 


1065(23) 


0.8 


600 


38(7) 


39(8) 


1071(6) 


1042(34) 


600 


48(1) 


48(1) 


1072(4) 


1067(13) 










Setting 3, p=2000, /3* = 


(1,-1,...)" 














s — 


3 








s — 


6 








M-A 


uiaxiRS D] 


: 8.47(0.17) 






M-A 


^ax(i?S'D) 


: 10.36(0.26) 






600 


3(0) 


3(0) 

s = 


3(1) 

L2 


3(0) 


600 


56(0) 


56(0) 
s = 


1227(7) 

24 


1142(64) 






M-An,ax (i?S'D) 


: 14.69(0.39) 






M-A 


^.4RSD) 


: 23.70(0.14) 






600 


63(6) 


63(6) 


1148(8) 


1093(59) 


600 


214.5(93) 


208.5(82) 


1120(5) 


1087(24) 
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Table 4 

The MMMS (Median Minimum Model Size) and the associated RSD (in the parenthesis) 
of the simulated examples for linear regressions when p — 40000. 



p 




o±o iviijjrv 


QTQ MMT F- 




SIS-MLR 


SIS-MMLE 










1 n — 


15 






S - 


- ■3, P — 




s ^ 


6, /3* = (1,1,3,1,...)^ 


u 


oU 






lOU 


42(157) 


42(157) 


0.2 


80 


3(0) 


3(0) 


150 


6(0) 


6(0) 


0.4 


80 


3(0) 


3(0) 


150 


6.5(1) 


6.5(1) 


0.6 


80 


3(0) 


3(0) 


150 


6(1) 


6(1) 


0.8 


80 


3(0) 


3(0) 
(1,1.3,...)" 


150 


7(1) 


7(1) 
1.3,...)^ 




s = 


12, 13* = 


s = 


= 15, 0* = (1 





300 


143(282) 


143(282) 


400 


135.5(167) 


135.5(167) 


0.2 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 


0.4 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 


0.6 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 


0.8 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 








Setting 


2, g = 


50 






s - 


= 3, /3* = 


(1,1.3,1)" 


s — 


6, /3* = (1,1.3,1,...)^ 





100 


3(2) 


3(2) 


200 


7.5(7) 


7.5(7) 


0.2 


100 


3(0) 


3(0) 


200 


6(1) 


6(1) 


0.4 


100 


3(0) 


3(0) 


200 


7(1) 


7(1) 


0.6 


100 


3(0) 


3(0) 


200 


7(2) 


7(2) 


0.8 


100 


3(1) 


3(1) 
(1,1.3,...)" 


200 


8(4) 


8(4) 
1.3,...)^ 




s = 


12, /3* = 


s = 


= 15, 0* = (1 





400 


22(27) 


22(27) 


500 


35(52) 


35(52) 


0.2 


300 


16(5) 


16(5) 


300 


24(7) 


24(7) 


0.4 


300 


19(8) 


19(8) 


300 


30(10) 


30(10) 


0.6 


300 


25(8) 


25(8) 


300 


33.5(7) 


33.5(7) 


0.8 


300 


24(7) 


24(7) 


300 


35(8) 


35(8) 
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Table 5 

The MMMS (Median Minimum Model Size) and the RSD (m the parenthesis) of the 
simulated examples for linear regressions when p = 2000 and 5000. The same caption as 

Table 3 is used. 



p 


n 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 


n 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 












Setting 1, j: 


)=5000, q=15 












s = 


3, /3* = (0.5,0.67,0.5)^ 






s = 


6, (3* = (0.5,0.67,...)^ 







100 


12(40) 


12(40) 


3(1) 


3(1) 


100 


210.5(422) 


210.5(422) 


33.5(651) 


25(22) 


0.2 


100 


3(1) 


3(1) 


3(0) 


3(0) 


100 


7(2) 


7(2) 


6(1) 


6(1) 


0.4 


100 


3(0) 


3(0) 


3(0) 


3(0) 


100 


7(2) 


7(2) 


6(1) 


6(1) 


0.6 


100 


3(1) 


3(1) 


5(7) 


5(5) 


100 


8(2) 


8(2) 


7(1) 


7(1) 


0.8 


100 


4(2) 


4(2) 


4(1) 


4(1) 


100 


9(3) 


9(3) 


7(2) 


8(1) 






s = 


12, 13* = (0.5,0.67,...)^ 






s — 


15, 13* = (0.5,0.67,...)^ 







300 


49(76) 


49(76) 


12(1) 


12(0) 


300 


199(251) 


199(251) 


17(2) 


15(0) 




iUU 


14(2) 


14(2) 


12(1) 


12(1) 


100 


17(5) 


17(5) 


15(1) 


15(0) 


U.4 


1 nn 
iUU 


14(1) 


14(1) 


12(1) 


12(1) 


100 


15(0) 


15(0) 


15(0) 


15(0) 


0.6 


100 


14(1) 


14(1) 


13(1) 


13(1) 


100 


15(0) 


15(0) 


15(0) 


15(0) 


U.o 


1 nn 
iUU 


14(1) 


14(1) 


13(1) 


13(1) 


100 


15(0) 


15(0) 


15(0) 


15(1) 












Setting 2, p=2000, q=50 












s — 


3, /3* = (0.6,0.8,0.6)^ 






s - 


= 6, /3* = (3, - 









100 


5(14) 


6(16) 


4(4) 


4(2) 


100 


15(43) 


18(47) 


6(0) 


6(1) 


0.2 


100 


3(1) 


3(1) 


3(0) 


3(0) 


100 


42(116) 


47(99) 


7(1) 


7(1) 


0.4 


100 


3(1) 


4(1) 


3(1) 


3(1) 


100 


143(207) 


129(226) 


12(4) 


12(5) 


0.6 


100 


5(3) 


7(5) 


4(1) 


4(1) 


200 


47(93) 


49(110) 


7(1) 


7(1) 


0.8 


100 


7(7) 


14(12) 


5(57) 


7(4) 


200 


360(470) 


376.5(486) 


54(32) 


51(25) 






s = 


12, /3* = (0.6,0.8,...)^ 






s 


= 24, /3* = (3,4,...)^ 







200 


151(212) 


140(207) 


15(4) 


15(4) 


400 


229(283) 


227(279) 


24(0) 


25(0) 


0.2 


100 


37.5(10) 


36(12) 


16(3) 


16(4) 


100 


61(43) 


67(46) 


30(2) 


30(2) 


0.4 


100 


39(7) 


40.5(8) 


18(3) 


17(2) 


100 


48(2) 


47(2) 


31(2) 


30(1) 


0.6 


100 


41(7) 


42(6) 


19(3) 


18(3) 


100 


48(2) 


49(2) 


32(2) 


32(3) 


0.8 


100 


44(5) 


46(6) 


23(1478) 


24(50) 


100 


49(2) 


49(1) 


32(2) 


32(2) 










Setting 3, p=2000, f3* = 


(1,-1,...) 


J 












s = 3 










s = 6 










M-A„,ax(i?5D) : 


8.47(0.17) 






M-A 


m.4RSD) : 10.36(0.26) 






600 


3(0) 


3(0) 


3(0) 


3(0) 


600 


56(0) 


56(0) 


47(4) 


45(3) 








s = 12 








s = 24 










M-A„,ax(i?SD) : 


14.69(0.39) 






M-> 


m^.,^{RSD) : 23.70(0.14) 






600 


62(0) 


62(0) 


1610(10) 


1304(2) 


600 


81(19) 


81(23) 


1637(14) 


1303(1) 



imsart-aos ver. 2007/09/18 file: SISGLIM-revise-v3.tex date: January 18, 2010 



34 



ing transformation models (IBicke 



censo red regression models (|Cox 



and Doksuml. 



ml 



20071 ). projection pursuit regression ( 



1981 



Box and Cox 



Kosorok et al 



2004 



Friedman and Stuetzld . 



Zeng and Lin 



1981 



). These 



are also interesting future research topics. 

Another important extension is to generalize the concept of marginal re- 
gression to the marginal group regression, where the number of covariates 
m in each marginal regression is greater or equal to one. This leads to a 
new procedure called grouped variables screening. It is expected to improve 
the situation when the variables are highly correlated and jointly impor- 
tant, but marginally the correlation between each individual variable and 
the response is weak. The current theoretical studies for the componentwise 
marginal regression can be directly extended to group variable screening, 
with appropriate conditions and adjustments. This leads to another inter- 
esting topic of future research. 

In practice, how to choose the tuning parameter 7^ is an interesting and 
important problem. As discussed in Fan and Lv (2008), for the first stage 
of the iterative SIS procedure, our preference is to select sufficiently many 
features, such that jA^-y^l = n or n/log(n). The FDR-based methods in 
multiple comparison can also possibly employed. In the second or final stage, 
Bayes information type of criterion can be applied. In practice, some data- 
driven methods may also be welcome for choosing the tuning parameter 7^. 
This is an interesting future research topic and is beyond the scope of the 
current paper. 



9. P roofs. To establish Theorem [H t he following symmetrization the- 



orem m 



van der Vaart and Wellnerl (119961 ). contraction theorem in Ledoux 



and Talagrand (1991) and concentration theorem in Massart (2000) will be 
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Lemma 2. (Symmetrization, Lemma 2.3.1, van der Vaart and Well- 
ner, 1996) Let . . . ,Zn he independent random variables with values in Z 
and F is a class of real valued functions on Z. Then 

E jsup |(Pn - < 2E jsup |P„e/(Z)|| , 

where £i,. . . ,En he a Rademacher sequence (i.e., i.i.d. sequence taking values 
±1 with probability 1/2) independent of Z\, . . . ,Zn and Pf{Z) = Ef{Z). 



Lemma 3. ( Contraction theorem, \Ledoux and Talagrand . 1991 ) Let 
zi,...,Zn be nonrandom elements of some space Z and let J-' be a class 
of real valued functions on Z. Let ei,...,e„ be a Rademacher sequence. 
Consider Lipschitz functions 7j : R i— >■ R, that is. 



hiis) - nis)\ <\s - s\, Vs, s e R. 



Then for any function f : Z Ti, we have 



E{snp |P„e(7(/) - 7(/))|} < 2^{sup |P„e(/ - /)|}. 



Lemma 4. ( Concentration theorem, \Massari . 2000) Let Zi, . . . , Z„ be 

independent random variables with values in some space Z and let 7 E F, 
a class of real valued functions on Z. We assume that for some positive 
constants li^^ and Ui^-y, li^^ < jiZi) < Ui^^ V7 G F. Define 



L'^ = snpy^ {ui J — li ■y)'^ /n, and 
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Z = sup|(P„,-P)7(Z)|, 

then for any t > 0, 

P{Z> EZ + t) < exp 

Let > 0, define a set of /3: 

B{N) = {(3 GB,\\fi- Poll <N}- 

Let 

Gi{N)= sup |(p„-p){/(x^/3,y)-/(x^/3o,y)}/„(x,y)|, 

/3eB{Af) 

wliere /n(X, y) is defined in Condition B. The next result is about the upper 
bound of the tail probability for Gi(A^) in the neighborhood of B{N). 

Lemma 5. For all t > 0, it holds that 

P{Gi{N) > ANkniq/n)^/\l + t)) < exp(-2tVi^n)- 

Proof of Lemma O The main idea is to apply the concentration theorem 
(Lemma m. To this end, we first show that the random variables involved 
are bounded. By Condition B and the Cauchy- Schwartz inequality, we have 
that on the set Qn, 

|/(X^/3,y)-/(X^/3o,F)| </c„|X^(/3-/3o)| < /c„||X|| \\(3 - (3q\\. 

On the set by the definition of B{N), the above random variable is 
further bounded by knq^^'^KnN . Hence, = Ak'^qK^N'^ , using the notation 
of Lemma m 
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We need to bound the expectation EGi{N). An application of the sym- 
metrization theorem (Lemma [2|) yields that 



EGi(N) < 2E 



sup |P„e {/(X^/3, Y) - l{X^(3o, Y)} /„(X, Y) \ 

J3£B{N) 



By the contraction theorem (Lemma [3|), and the Lipschitz condition in Con- 
dition B, we can bound the right hand side of the above inequality further 

by 



(6) 



AknEl sup |P„eX^(/3-/3o)/„(X,y)| 
1/3gB(jv) 



By the Cauchy-Schwartz inequality, the expectation in ([6]) is controlled by 

(7) ^;||P„eXI„(X,y)|| sup ||/3-/3ol| <^||PneXI„(X,y)||7V. 

/3eB(Af) 

By Jensen's inequality, the expectation in ([7]) is bounded above by 



^;||P„eX/„(X,y)f)'/' = {E\\Xfln{^,Y)/ny'^ < (g/ 



1/2 



n 



,1/2 



by noticing that 

S||Xf /„(X, Y) < E\\Xf = E{Xl + ---+Xl)=q, 
since EX'j = 1. Combining these results, we conclude that 

EGi{N) < mkn{q/nfl^. 
An application of the concentration theorem (Lemma H]) yields that 

P(G,.„4.M,A,..-(I.<), . e.p(-~W) 

= exp(-2tVi^^). 

This proves the lemma. □ 
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Proof of Theorem [H The proof takes two main steps: we first bound 
11/3 — /3oll by G{N) for a small N, where N chosen so that Conditions B and 



C hold, and then utilize Lem ma [5] to cone. 



van de Geer 



ude. 



(I2OO2I ). we define a convex com- 



FoUowing a similar idea in 
bination = 80 + {1 — s)(3q with 

s=(l + 0-(3o\\/Ny\ 

Then, by definition, 

||/3,-/3o|| =5||/3-/3oll <iV, 
namely, P^. £ B{N). Due to the convexity, we have 

PJ(X'^/3„y) < sP„/(X^/3,y) + (l-s)PJ(X^/3o,y) 

(8) < PJ(X^/3o,l^). 

Since jS^ is the minimizer, we have 

£;[/(x^/3„y)-/(x^/3o,r)] >o, 

where j3g is regarded a parameter in the above expectation. Hence, it follows 
from dS]) that 

E[l{-K'f3,,Y)-l{-K^(5^,Y)\ 

< (i?-P„)[Z(X^/3„y)-/(X^/3o,r)] 

< G(iV), 

where 

G{N)= sup |(P„-P){/(X^/3,>^)-^(X^/3o,n}|- 

(3(iB{N) 

By Condition C, it follows that 

(9) \\P,-(3,\\<[Q{N)/Vn]"\ 
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We now use ([9]) to conclude the result. Note that for any x, 

P{\\(3, - (3o\\ >x)< P{G{N) > Vnx^). 

Settmg X = N/2, we have 

- /3o|| > N/2) < P{G{N) > VnN^A}. 

Using the definition of f3g, the left hand side is the same as /3oll ^ -^}- 

Now, by taking N = 4a„(l + t)/Vn with a„ = Akn^/q/n, we have 

Pm - /3oll > ^} < P{G{N) > KiVV4} 

= P{G{N)>Nan{l + t)]. 

The last probability is bounded by 

(10) P{G{N) > Nan{l + t),nn,.} + P{^n,.}, 

where n^,, = {||X,|| < K^, \Yi\ < K^}. 
On the set i7„ j., since 



sup r„ 

f3(iB(N) 



Y) - l{X.^(3o, Y) (1 - /„(X, Y)) = 0, 
by the triangular inequality, 



G(7V) <Gi(iV)+ sup E[z(x^/3,y)-^(x'^/3o,y)](i-/n(x,y)) 

I3(^B(N) 

It follows from Condition B that (jlOp is bounded by 

P{Gi{N) > Nanil + t) + o{q/n)} + nP{(X,Y) G 
The conclusion follows from Lemma [5l □ 

Proof of Theorem\^ First of all, the target function EI{Pq + (3jXj,Y) is a 
convex function in (3j. We first show that if cov{b' (X.^ (3*) , X j) = 0, then pj'^ 
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must be zero. Recall EXj = 0. The score equation of the marginal regression 
at fSf^ takes the form: 



E {6'(Xj/3f = E{YXj) = E {6'(xV)X,} . 

It can be equivalently written as 

cov(6'(Xj/3f = cov(6'(X^/3*),Xj) = 0. 

Since both functions f{t) = 6'(/3j^Q + t) and h{t) = t are strictly monotone 
in t, when t ^ 0, 

{/(t)-/(0)}(t-0)>0. 
If /3f /O, let t = f3fXj, 

/3f cov(/(/3f X,), X^) = E[E{f{t) - /(0)}(t - 0)\X, / 0] > 0, 

which leads to a contradiction. Hence /3j*^ must be zero. 

On the other side, if = 0, the score equations now take the form: 



(11) 
(12) 



^{^'(/3fo)}=^{&'(XV)}, and 
E{b'i^f^i)X,} = E{b'{X.^(3*)X,}. 



Since b'{f3j Q) is a constant, we can get the desired result by plugging ([TT]) 
into (HI]). □ 



Proof of Theorem\^ We first prove the case that h"{0) is bounded. By the 
Lipschitz continuity of the function we have 

{y{fi% + Xj(3f) - b'{pfj,)}X,\ < D,\pf\Xj. 

Di = sup^, h"{x). By taking the expectation on both sides, we have 
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namely, 

(13) A|/3f I > |cov(6'(/3fo + X,/3f 
Note that /3j^ and fSj^ satisfy the score equation 

(14) E{b'i(3^;, + /3f X,) - 6'(X^/3*)}X, = 0. 
It follows from ([l3]) and EXj = that 

|/3f I > Df^cin-^ 

The conclusion follows. 

We now prove the second case. The result holds trivially if > cn~'^ for 
a sufficiently large universal constant c. Now suppose that < CQn~'^, for 
some positive constant cg . We will show later that | /3*q — \ < cio for some 
cio > 0, where /3q^ is such that h'{l3^^) = EY . In this case, if \Xj\ < n'^, then 
the points and {P^'^q + Xjl3f) falls in the interval /3q^ ± h, independent 
of j, where h = cg + ciq. 

By the Lipschitz continuity of the function &'(•) in the neighborhood 
around /3q^, we have for \Xj\ < n*^, 

\{h'{f^% + ) - 6'(/3fo)}X,| < D,\pf\X]. 

where D2 = max^g[^M_^ ^m^^j 6"(x). By taking the expectation on both 
sides, it follows that 

(15) \E{h'{P^;, + X,/3f ) - 6'(/3fo)}X,/(|X,| < n^)\ < D^l/Sf I- 
By using (fn|) and EXj = 0, we deduce from ([T5|) that 

(16) D2|/3f I > |cov(X^r,X,-)| - ^ - ^1. 
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where Am = E\b'{f3f^o + XJ'i3f)Xj\ I{\Xj\ > n*") for m = and 1. Since 
|/3*Q + < a\Xj\ for > n'^ for a sufficiently large n, independent 

of j, by the condition given in the theorem, we have 

Am < EG{a\Xj\)'^\Xj\ I{\Xj\ > n'^) < dn''^, for m = and 1. 

The conclusion now follows from ()16p . 

It remains to show that when |/3j*^| < cgn~'^ we have |/3*q — (3^^] < cio- 
To this end, let 

i{/3o) = E{bif3o + /3f X,) - y(/3o + /3f X,-)}. 
Then, it is easy to see that 

i'i/3o) = Eb'il3o + /3fX^)-b'{f3^). 

Observe that 

(17) \Eb'il3o + /3fXj) - b'{/3o)\ <Ri + R2, 

where Ri = sup|^|<cg„r,-. |6'(/3o + 2;) - fe'(/3o)| and R2 = 2EG{a\Xj\)I{\Xj\ > 
n^). Now, Ri = 0(1) due to the continuity of b'{-) and R2 = o(l) by the 
condition of the theorem. Consequently, by (jl7p . we conclude that 

^'(/3o) = 6'(/3o)-6'(/3o^O + o(l). 

Since b'{-) is a strictly increasing function, it is now obvious that 

- cio) < 0, + Clo) > 0. 

for any given cio > 0. Hence, — j3^^\ < cio- □ 

Proof of Proposition [7J Without loss of generality, assume that g{-) is 
strictly increasing and p > 0. Since X and Z are jointly normally distributed, 
Z can be expressed as 

Z = pX + e, 
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E{XZ) is the regression coefficient and X and e are independent. 



(18) Ef{Z)X = Eg{pX)X = E[g{pX) - g^X, 

where g{x) = Ef{x + £) is a strictly increasing function. The right hand side 
of p8|) is always nonnegative and is zero if and only if p = 0. 

To prove the second part, we first note that the random variable on the 
right hand side of (jlSp is nonnegative. Thus, by the mean-value theorem, we 
have that 

\Ef{Z)X\ > inf \g'{x)\pEX^I{\X\<c). 

\x\<cp 

Hence, the result follows. □ 

Proof of LemmaUi By Chebyshev's inequality, 

P{Y >u)< e^p{-sou)Ee^p{soY). 
Let e = X.'^l3\ Since Y belongs to an exponential family, we have 
E{exp{soY)\e} = exp{b{e + so)-b{e)). 

Hence 

P{Y >u)< exp(-son)Sexp(6(X^/3* + sq) - &(X^/3*)). 
Similarly we can get 

P{Y < -u) < ex.p{-sou)E ex.p{b{X.^ (3* - sq) - 6(X^/3*)). 
The desired result thus follows from Condition D by letting u = mot" /so . □ 
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Proof of Theorem^ Note that Condition B is satisfied with A;„ defined in 
Section 5.2. The tail part of Condition B can also be easily checked. In fact, 

E[iixJp^,Y) - /(/3f ,y)](i - /„(x,,y)) 

< \Eb{Xj(3j)I{\Xj\ > Kn)\ + \Eb{X.Jl3f)I{\Xj\ > J^„)| + i?(/3,.) + B{(3f), 

where B((3j) = lEYJiJ f3j{l — /„(Xj,y))|. The first two terms is of order 
o{l/n) by assumption and the last two terms can be bounded by the expo- 
nential tail conditions in Condition D and the Cauchy- Schwartz inequality. 
By Theorem [H we have for any t > 0, 

P{Vn\Pj^ - pf\ > 16(1 + t)kn/V) < exp(-2tV^^) + nmi exp(-moi^"). 

By taking l + t = C3Vn^/'^~'' /{16kn), it follows that 

P(|/3f - /3f I > csn-'^) < exp(-C4ni-2-/(fc^/^„)2) + nmi exp(-moi^°). 

The first result follows from the union bound of probability. 
To prove the second part, note that on the event 

An = { max l^f - /3f I < 02^-72}, 

by Theorem [3l we have 

(19) I > C2n-'^/2, for ah j G M^. 

Hence, by the choice of 7„, we have C A^7„. The result now follows 
from a simple union bound: 

P^Al) < s„{exp(-C4ni-27(A:„K„)2) + nmiexp(-moi^^)}. 

This completes the proof. □ 
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Proof of Theorem O The key idea of the proof is to show that 

(20) \\l3^'f = 0(||S/3*f) = 0{A^ax(5])}. 

If so, the number of {j : > en~'^} can not exceed 0{n^'^Amax(^)} for 
any e > 0. Thus, on the set 

Bn = { max l^f - /3f I < en"'^}, 

the number of {j : > 2en~'^} can not exceed the number of {j : > 
en~'^}, which is bounded by 0{n^'^ Amax(S)}. By taking e = C5/2, we have 

P[\MyJ < 0{n2-A„,ax(5])}] > P{Bn). 

The conclusion follows from Theorem |3)^i) . 

It remains to prove (I20D. We first bound I3f. Since b'{-) is monotonically 
increasing, the function 

{6'(/3fo + X,/3f)-6'(/3fo)}X./3f 

is always positive. By Taylor's expansion, we have 

{6'(/3fo + XjPf) - 6'(/3j;^)}/3f X, > D,iPfX,fli\Xj\ < K), 

where = inf|^|<^(5^;^) 6"(x), since (/3*o,/3*^) is an interior point of the 
square B with length 2B. By taking the expectation on both sides and using 
EXj = 0, we have 

Eh'{P% + X^pf)pfX^ > D^iE{pfXjfl{\X^\ < K). 

Since EXp{\Xj\ < K) = 1 - EXp{\Xj\ > K), it is unformly bounded 
from below for a sufficiently large K, due to the uniform exponential tail 
bound in Condition D. Thus, it follows from (1121) that 



(21) |/3f |2 < D4|^6'(X^/3*)Xj|, 
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for some D4 > 0. 

We now further bound from above the right hand side of (j2ip by using 
var = 0(1). We first show the case where b"{-) is bounded. By the 

Lipschitz continuity of the function &'(•), we have 



where Xm = (^i, • • • , and (3^ = (/3^ . . . , f3;j^. 

By putting the above equation into the vector form and taking the ex- 
pectation on both sides, we have 



(22) 



i?{6'(X^/3'^) - 6'(/3o*)}Xm 



£'XAfXjj/3^ 



Using EXm = and var(X^ /3*) = 0(1), we conclude that 

max \^ li 

for some positive constant D^. This together with ()2ip entail ()20p . 

It remains to bound (12ip for the second case. Since Xm = iJI^r U, it 
fohows that 

Eh'iP'^ + Xi:,/3t)XM = Eh' {131 + /3i^i?U)i?s}/'u, 
where = ^i^^/^i- By conditioning on 02'^-, it can be computed that 



E{\]\0l\J) = (5lV/\\f32ff32- 



Therefore, 



Eh' {131 + Xl,/3t)X 



M 



Eh'{Pt,+(3lR\J)RJ:''^^Pl\J/\\P2\?P 



Eb'{X^P*){^'^f3* - (31)J:\'^(32/\\(32 
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This entails that 

(23) \\Eb'CX^(3*)XMf = |i?6'(xV)(xV - /3o)|2 ||s}/'/32ll Vll/32f , 

By Condition G, \Eb' (X.^ p''){X.^ (3* - 13^)\ = 0(1). We also observe the facts 
that llSi/^/^sll < Am4l(S) 11/32 II and that W/S^W = ||S;i/2^*|| is bounded. This 
proves (I20p for the second case by using (I2ip and completes the proof. □ 

Proof of Theorem\^ If cov(6'(X^/3*), Xj) = 0, by Theorem H we have 
I3f^ = 0, hence by the model identifiability at /3*(| = /3^'^. Hence, L* = 0. 
On the other hand, if L* = 0, by Condition C", it fohows that f3f = P^, 
that is, = and /J*'^ = 0. Hence by TheoremEJ cov(6'(X^/3*), Xj) = 0. 
□ 

Proo/ of Theorem^ If |cov(6'(X^/3*), Xj)| > cin"'', for j G by 
Theorem m we have minjg_v(* ^ C2n~'^. The first result thus follows 
from Condition C. 

To prove the second result, we will bound L*-. We first show the case 
where h"{-) is bounded. By definition, we have 

(24) < < e[i{p%,y) - Z(Xj/3f ,y)}. 

By Taylor's expansion of the right hand side of (|24p . we have that 

(25) ^{/(/3*i Y) - l(Xjf3f, ¥)] < D,ii3f'f, for some D, > 0. 

The desired result thus follows from (I24p . (125p and the proof in Theorem [5] 
that 

||Ll|<0(||/3^f) = 0(A^ax(5])). 
Now we prove the second case. By the mean-value theorem, 

(26) E{li/3fj,,Y) - l{Xj(3f,Y)] = e{y - + .X,/3f )}x,/3f , 
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for some < s < 1. Since EYXj = Eb'(X.JPj^)Xj, the last term is equal to 



(27) 



By the monotonicity of when Xj/Sj" > 0, both factors in (|27p is non- 
negative, and hence 
(28) 

{6'(Xj/3f ) - 6'(/3j;^ + .X,/3f )}x,/3f < {6'(Xj/3f ) - 6'(/3fo)}x,/3j 



When Xj(3j'^ < 0, both factors in ([28l) are negative and (f28]l continues to 



hold. It follows from ([26])— ([28]) and EXj = 0, the right hand side of ([26]) is 
bounded by 



(29) 



Eb'{X.J/3f)XjPf = Eb'iX^P*)Xjf3f'. 



Combining ()24p . (I26p and (I29p . we can bound ||L*|| in the vector form by 
Cauchy-Schwartz inequality: 



IL-^II < 



i^6'(xV)XA/ \\(3''\\ = 0{\\^n\(3 



M\ 



where ([23]) is used in the last equality. The desired result thus follows from 
Theorem [5l □ 



Proof of Theorem\^ To prove the result, we first bound Lj^„ from below 
to show the strength of the signals. Let = (/3q^,0)"^. Then, by Taylor's 
expansion, we have 

(30) 2L,-„ = CA' - )4'(^ J(^J' - ) > A,-^in(/3/V, 
where Aj^min is the minimum eigenvalue of the Hessian matrix 

^•(^„)=Pn6"(^^X,)X,Xj, 
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where ^„ lies between /3g and (3j . We will show 
(31) P{Xj^min > cii} = 1 - 0{exp{-cW--)}, 

for some cn > and ci2 > 0. 

Suppose ([3T]) holds. Then, by ([19]), we have 

P{ min |/3f I > C2n-''/2} 
= 1-0 (s„{exp(-C4n^~^''/(A;„K„)2) + nmi exp(-moir^)}) . 

This, together with (pOj) and ([3T]) . implies 

P{ min L, „ > cuc^n''^'^ /^} 
= 1-0 (sn{exp(-C4n^~^''/(feni^n)^) + nnii exp{-moK^)}) . 

Hence, by choosing the thresholding = cyn"^'^, for cj < cuc^/S, AA^, C 
with the probability tending to one exponentially fast, and the result 
follows. 

We now prove (i3T]l . It is obvious that 

i'^iU > , , min b"{x) P„X,Xj/(|X,| < K), 
for any given K. Since the random variable i nvolved is unifor mly bounded 



1963|) that 



in j, it follows from the Hoeffding inequality (|Hoeffdinei . 

(32) P{|(P„ - P)X^I{\Xj\ < K)\ >e}< exp(-2neV(4i^'*^)), 
for any k > and e > 0. By taking e = n~'^/^, we have 

P{|(P„ - P)X^I{\Xj\ < K)\ > n-^/2| < exp(-2n^-7(4K2'=)). 
Consequently, with probability tending to one exponentially fast, we have 

(33) <'(^n) > min b"{x) EXjX.J I{\Xj\ < K)/2, 

^ \x\<iB+l)K J 3 Jt - 
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The minimum eigenvalue of the matrix E^jJi.J I{\Xj\ < K) is 
min E(a^ + 2a\/l - a^X^ + (1 - a^)Xhl(\Xi\ < K). 

\a\<l ^ JJ y< j< J 

It is bounded from below by 

(34) mm Eia"^ + (1 - a'^)Xp{\Xj\ < K)} - 2\EXjI{\Xj\ < K)\ - K''^ , 
where we used P{\Xj\ > K) < K~^. Since EXj = and EX^^ = 1, 

\EXjI{\Xj\ < K)\ = \EXjI{\Xj\ > K)\ < K-^EXp{\Xj\ > K) < K'^. 

Hence the quantity in ()34p can be further bounded from below by 

EX^^I{\XA < K) + min a^EX^Ji\XA > K) - IK'^ - 
J \| |a|<l J " -J' 

> 1 - sup EXpi\Xj\ > K)- 2K~^ - K~^. 
j 

The result follows from Condition G and (jSSp . 

Proof of Theorem\^ By (|30p . it can be easily seen that 

where Di = sup^ h" [x) as defined in the proof of Theorem [3j By (j32p , with 
the exception on a set with negligible probability, it follows that 

Amax(lP'nXjXj) < 2Amax(-E'XjXj) = 2, 

uniformly in j. Therefore, with probability tending to one exponentially fast, 
we have 

(35) L,,n<D-r{0^' -^p%f + Cpff}, 
for some Dj > 0. 
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We now use ([35]) to show that if Lj^„ > cjn -2'^, then > Dsn-^'^, with 
exception on a set with negligible probability, where Dg = {cs/{2Dj)}^^'^. 
This implies that 

with 7„ = D^n~'^. The conclusion then follows from Theorem [5l 

We now show that Lj^n > CYn~'^'^ implies that |/3j^| > D^n~'^, with ex- 
ception on a set with negligible probability. Suppose that |/3*^| < D^n~'^. 
From the likelihood equations, we have 

(36) = Y = P„6'(/3fo + /3f X,). 

From the proof of Theorem |4l with exception on a set with negligible prob- 
ability, we have - (3^'] < cisn"'^ and |/3*o - /3j^ol - ci4n~'*, for some 
constants C13 and C14. Since (/3q^,0) and (/3*0"^j^^) interior points of 
the square B with length 2B, it follows that with exception on a set with 
negligible probability, |/3^| < B and < B. Recall = sup^6"(x). By 
Taylor expansion, for some < s < 1, we have 

|Pn6'(® + /3*%-)-6'(/3*^)| = \b"Cp% + spfX^)pf¥nX,\ 

(37) < I)i|/3fP„X,-| = op(|/3f I), 

where the last step follows from the facts that EXj = and consequently 
P„Xj =0(1) with an exception on a set of negligible probability, by applying 
the Hoeffding inequality on il^ and considering the exponential tail property 
of Xj. Hence, by (p6|) and ([37|) . we have 

\h'0'o')-h'& = op{\pf\). 

Let Z?9 = inf|2,|<2B b"{x), with exception on a set with negligible probability, 
we have 

|6'(/3o^)-6'(/3fo)|>I?9|/3o^-/3fol- 
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Therefore, we conclude that 

l/3o^-/5fol=op(|/3f|). 

By (j35p . we have |/3j^^| > D^n~'^. This is a contraction, except on a set 
that has a negligible probability. This completes the proof. □ 
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SURE INDEPENDENCE SCREENING IN GENERALIZED 
LINEAR MODELS WITH NP-DIMENSIONALITY* 

By Jianqing Fan and Rui Song 
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(N : 

Ultrahigh dimensional variable selection plays an increasingly 

s . 

, important role in contemp orary scienti f ic dis coveries and statisti- 



oo 



>< 



Fan and Lvl (|2008l ) propose an indepen- 



cal research. Among others, 

dent screening framework by ranking the marginal correlations. They 
showed that the correlation ranking procedure possesses a sure inde- 
pendence screening property within the context of the linear model 
with Gaussian covariates and responses. In this paper, we propose a 
■ more general version of the independent learning with ranking the 

^ , ■ maximum marginal likelihood estimates or the maximum marginal 

likelihood itself in ge neralized linear mo dels. We show that the pro- 



^ ' posed methods, with lFan and Lvl (|2008l ) as a very special case, also 

' possess the sure screening property with vanishing false selection rate. 

(N ■ The conditions under which that the independence learning possesses 

■ 

' a sure screening is surprisingly simple. This justifies the applicability 

' of such a simple method in a wide spectrum. We quantify explicitly 

^— ■>) ^ the extent to which the dimensionality can be reduced by indepen- 

dence screening, which depends on the interactions of the covariance 
matrix of covariates and true parameters. Simulation studies are used 



^ , to illustrate the utility of the proposed approaches. In addition, we 
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establish an exponential inequality for the quasi-maximum likelihood 
estimator which is useful for high-dimensional statistical learning. 



1. Introduction. Ultrahigh dimensional regression problem is a signif- 
icant feature in many areas of modern scientific research using quantitative 
measurements such as microarrays, genomics, proteomics, brain images and 
genetic data. For example, in studying the associations between phenotypes 
such as height and cholesterol level and genotypes, it can involve millions of 
SNPs; in disease classification using microarray data, it can use thousands of 
expression profiles; and dimensionality grows rapidly when interactions are 
considered. Such a demand from applications brings a lot of challenge to sta- 
tistical inference, as the dimension p can grow much faster than the sample 
size n such that many models are not even identifiable. By non-polynomial 
dimensionality or simply NP-dimensionality, we mean logp = 0(n") for 
some a > 0. We will also loosely refer it to as an ultrahigh dimensionality. 
The phenomenon of noise accumulation in high-dimensional re gression has 
also b een o bserved by statistici ans and computer scientists. See 



(12003) and 



Fan and Lv 



Fan and Fan 



(120081 ) for a comprehensive review and references 
therein. When dimension p is ultrahigh, it is often assumed that only a small 
number of variables among predictors Xi, . . . , Xp contribute to the response, 
which leads to the sparsity of the parameter vector /3. As a consequence, 
variable selection plays a prominent role in high dimensional statistical mod- 
eling. 

Many variable selection techniques for various high dimensional statis- 
tical models have been proposed. Most of them are based on the penal- 



ized pseudo-likelihoo c 



Frank and Friedman 



app roach, to name a few, the 



19931), the LASSO in 



j ridge regression in 



Tibshirani 



19961), the SCAD 
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a nd other folded-concave penalty in 



Fan and Li (200 



m 



Candes and Tad (|2007l ) and their related methods (jZou 



le Dantzig selector 



2000; 



Zou and Lil . 



20081). Theoretical studies of these methods conce ntrate on the persistency 



( Greenshtei 


n and Rito\ 




200^ 




van de Geei 




2008) 


properties ( 


Fan and Li. 


2001: 


Zou. 


2006 


). However, 



20081 ). consistency and oracle 



statistical learning problems, these methods may not perform well due to 
the simultaneous challenges of computat i onal expediency, statistical accu- 



ra cy and algor i thmic stability ()Fan et al 



20091 ). 



Fan and Lvl (j2008l ) proposed a sure independent screening (SIS) method 



Huang et al 



Fan and Lv ( 



(20081) for a related 



20081 ) showed that 



to select important variables in ultrahigh dimensional linear models. Their 
proposed two-stage procedure can deal with the aforementioned three chal- 
lenges better than other methods. See also 
study based on a marginal bridge regression, 
the correlation ranking of features possesses a sure independence screening 
(SIS) property under certain conditions, that is, with probability very close 
to 1, the independence screening technique retains all of the important vari- 
ables in the model. However, the SIS procedure in lFan and Lvl ()2008l ) only 



restricts to the ordinary linear models and their technical arguments depend 
heavily on the joint normality assumptions and can not easily be extended 
even within the context of a linear model. T his limits sign i ficant ly its use in 



Huang et al 



20081 ) also inves- 



practice which excludes categorical variables, 
tigate the marginal bridge regression in the ordinary linear model and their 
arguments depend also heavily on the explicit expressions of the least-square 
estimator and bridge regression. This calls for research on SIS procedures in 
more general models and under less restrictive assumptions. 

In this paper, we consider an independence learning by ranking the max- 
imum marginal likelihood estimator (MMLE) or maximum marginal like- 
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lihood itself for generalized linear models. That is, we fit p marginal re- 
gressions by maximizing the marginal likelihood with response Y and the 
marginal covariate Xj, i = 1, . . . ,p (and the intercept) each time. The mag- 
nitude of the absolute values of the MMLE can preserve the non-sparsity 
information of the joint regression models, provided that the true values 
of the marginal likelihood preserve the non-sparsity of the joint regression 
models and that the MMLE estimates the true values of the marginal likeli- 
hood uniformly well. The former holds under a surprisingly simple condition, 
whereas the latter requires a dev elopment o f unifo rm convergence over NP- 



dimensional marginal likelihoods. 



Hall et al. 



(I2OO9I ) used a di fferent marginal 



Hall and Miller 



utility , derived from an empirical likelihood point of view. 
(j2009l ) proposed a generalized correlation ranking, which allows nonlinear 
regression. Both papers proposed an interesting bootstrap method to assess 
the authority of the selected features. 

As the MMLE or maximum likelihood ranking is equivalent to the marginal 
correlation ranking in the ordinary linear mo dels, our work can thus be con- 
sidered as an important extension of SIS in 



Fan and Lv 



20081), where the 



joint normality of the respon se and cova r iates is imposed. Moreover, our 
results improve over these in iFan and Lvl (j2008l ) in at least three aspects. 



Firstly, we establish a new framework for having SIS properties, which does 
not build on the normality assumption even in the linear model setting. Sec- 
on dly, while it is not obvious (and could be hard) to generalize the proof 



of 



Fan and Lvl (j2008l ) to more complicated models, in the current frame- 



work, the SIS procedure can be applied to the generalized linear models 
and possibly other models. Thir dly, our results can eas ily be applied to the 



generalized correlation ranking (jHall and Miller 



2OO9I ) and other rankings 



based on a group of marginal variables. 
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Fitting ma rginal 



n iodels to a joint regression is a type of model mis- 



specification (jWhite 



19821 ). since we drop out most covariates from the 



model fitting. In this paper, we establish a nonasymptotic tail probability 
bound for the MMLE under model misspecifications, which is beyond the 
traditional asymptotic framework of model misspecification and of interest 
in its own right. As a practical screening method, independent screening 
can miss variables that are marginally weakly correlated with the response 
variables, but jointly highly important to the response variables, and also 



r an k some joint 
ods 



unimportant variables too high by using marginal meth- 



Fan and Lvl (120081 ) and Fan et al. (2009) develop iteratively conditional 
screening and selection methods to make the procedures robust and practi- 
cal. The former focuses on ordinary linear models and the latter improves 
the idea in the former and expands significantly the scope of applicability, 
including generalized linear models. 

The SIS property can be achieved as long as the surrogate, in this case, 
the marginal utility, can preserve the non-sparsity of the true parameter 
values. With a similar idea. Fan et al. (2009) proposed a SIS procedure 
for generalized linear models, by sorting the maximum likelihood functions, 
which is a type of "marginal likelihood ratio" ranking, whereas the MMLE 
can be viewed as a Wald type of statistic. The two methods are equivalent 
in terms of sure screening properties in our proposed framework. This will 
be demonstrated in our paper. The key technical challenge in the maximum 
marginal likelihood ranking is that the signal can even be weaker than the 
noise. We overcome this technical difficulty by using the invariance property 
of ranking under monotonic transforms. 

The rest of the paper is organized as follows. In Section 2, we briefly 
introduce the setups of the generalized linear models. The SIS procedure 
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is presented in Section 3. In Section 4, we provide an exponential bound 
for quasi maximum likelihood estimator. The SIS properties of the MMLE 
learning are presented in Section 5. In Section 6, we formulate the marginal 
likelihood screening and show the SIS property. Some simulation results are 
presented in Section 7. A summary of our findings and discussions is in 
Section 8. The detailed proofs are relegated to Section 9. 

2. Generalized Linear Models. Assume that the random scalar Y is 
from an exponential family with the probability density function taking the 
canonical form 

(1) fY{y;0) = exp{ye-b{e) + c{y)}, 

for some known functions b{-), c(-) and unknown function 9. Here we do not 
consider the dispersion parameter as we only model the mean regression. We 
can easily introduce a dispersion parameter in ([T|) and the results continue 
to hold. The function 9 is usually called the canonical or natural parameter. 
The mean response is b'{6), the first derivative of b{6) with respect to 9. 
We consider the problem of estimating a (p + l)-vector of parameter f3 = 
Wo, Pi, . . . ,j3p) from the following generalized linear model 

(2) ^(y|X = x) = b'{9{^)) = g-\Y.P,x^), 

3=0 

where x = {xq, xi, . . . , Xp}^ is a (p+l)-dimensional covariate and xq = 1 
represents the intercept. If g is the canonical link, i.e., g = {b')~^, then 
^(^) ~ Si=o /^j-^j- focus on the canonical link function in this paper for 
simplicity of presentation. 

Assume that the observed data {(Xj,yj),i = 1, . . . ,n} are i.i.d. copies of 
(X, y), where the covariate X = [Xq,Xi, . . . ,Xp) is a (p + l)-dimensional 
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random vector and Xq = 1. We allow p to grow with n and denote it as pn 
whenever needed. 

We note that the ordinary linear model Y = 'K^fB + e with e ~ N{0, 1) 
is a special case of model ([2]), by taking g{fi) = fi and b{9) = (P' 12. When 
the design matrix X is standardized, the ranking by the magnitude of the 
marginal correlation is in fact the same as the ranking by the magnitude 
of the maximum marginal likelihood estimator (MMLE). Next we propose 
an independence screening method to GLIM based on the MMLE. We also 
assume that the covariates are standardized to have mean zero and standard 
deviation one: 

EXj = 0, and EX] = 1, j = 1, • • • , p„. 

3. Independence Screening with MMLE. Let A^^ = {1 < j < 

Pn ■ 7^ 0} be the true sparse model with non-sparsity size s„ = |A^+|, 
where (3* = {/3q, /3^, . . . , f3*^) denotes the true value. In this paper, we refer 
to marginal models as fitting models with componentwise covariates. The 
maximum marginal likelihood estimator (MMLE) /3j , for j = 1, . . . is 
defined as the minimizer of the componentwise regression: 

= (/3fo,/3f ) = argmin^^,^/„^(/3o + ^jX„Y), 

where 1{Y; 9) = - [9Y - h{9) - log c{Y)] and P„/(X, Y) = n'^ Ylti fi^h 
is the empirical measure. This can be rapidly computed and its implemen- 
tation is robust, avoiding numerical instability in NP-dimensional problems. 
We correspondingly define the population version of the minimizer of the 
componentwise regression, 

/3f = {(3f^, /3f ) = argmin^^,^^.i?/(/3o + f3jXj,Y), for j = l,...,pn, 

where E denotes the expectation under the true model. 
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We select a set of variables 

(3) M^^={l<j<Pn-Wf\>ln}, 

where 7^ is a predefined threshold value. Such an independence learning 
ranks the importance of features according to their magnitude of marginal 
regression coefficients. With an independence learning, we dramatically de- 
crease the dimension of the parameter space from p„ (possibly hundreds 
of thousands) to a much smaller number by choosing a large 7„, hence 
the computation is much more feasible. Although the interpretations and 
implications of the marginal models are biased from the joint model, the 
non-sparse information about the joint model can be passed along to the 
marginal model under a mild condition. Hence it is suitable for the purpose 
of variable screening. Next we will show under certain conditions that the 
sure screening property holds, i.e., the set A^^ belongs to A4j„ with prob- 
ability one asymptotically, for an appropriate choice of 7„. To accomplish 
this, we need the following technical device. 

4. An exponential bound for QMLE. In this section, we obtain an 
exponential bound for the quasi-MLE (QMLE), which will be used in the 
next section. Since this result holds under very general conditions and is 
of self-interest, in the following we make a more general description of the 
model and its conditions. 

Consider data {Xj,l^}, i = 1, ...,n are n i.i.d. samples of (X.,Y) £ X x 
y for some space X and y. A regression model for X and Y is assumed 
with quasi- likelihood function -/(X^/3, Y). Here Y and X = {Xi, . . . , X^f 
represent the response and the g-dimensional covariate vector, which may 
include both discrete and continuous components and the dimensionality 
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can also depend on n. Let 

f3o = &TgmmpEl{X^l3,Y) 

be the population parameter. Assume that /3q is an interior point of a suffi- 
ciently large, compact and convex set B € R'^. The following conditions on 
the model are needed: 



A. The Fisher information 
= E 



is finite and positive definite at (3 = (3q. Moreover, ||/(/3)||g = 
sup^gg ||x||=i exists, where || • || is the Euclidean norm. 

B. The function /(x-^/3,y) satisfies the Lipschitz property with positive 
constant kn'. 

|Z(x^/3,y) -Z(xV,y)|/„(x,y) < A;„|x'^/3 - x^/3'|/n(x, y), 

for /3,/3' G B, where In{^-,y) = I{{:x.,y) G l^n) with 

Qn = {(x,y) : ||x||oo < Kn, \y\ < K*}, 

for some sufficiently large positive constants Kn and and || * \\c>o 
being the supremum norm. In addition, there exists a sufficiently large 
constant C such that with 6„ = CknV~^{q/ny/'^ and Vn given in 
Condition C 

sup |i?[Z(X^/3,y)-/(X^/3o,r)](l-/„(X,y))| < o{q/n). 

peB, ||/3-/3u||<6„ 

where Vn is the constant given in Condition C. 

C. The function Z(X"^/3,y) is convex in /3, satisfying 

E{l{X^(3, Y) - /(X^/3o, Y)) > Vn\\(3 - /3of , 
for all 11/3 — /3o 1 1 < 6„ and some positive constants Vn- 
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Condition A is analogous to assumption A6 (b) of White (1982) and as- 



sumption Rg in 



Fahrmeir and KaufmannI (|l986l ). It ensures the identifiabil 



ity and the existence of the QMLE and is satisfied for many examples of 
generalized linear models. Conditions A and C are overlapped but not the 
same. 

We now establish an exponential bound for the tail probability of the 
QMLE: 

^ = argmin^P„Z(X^/3,y) 

The idea of the proof is to connect ■y/n||/3— /3o|| to the tail of certain empirical 
processes and utilize the convexity and Lipschitz continuities. 

Theorem 1. Under Conditions A-C, it holds that for any t > 0, 

P (V^p - /3oll > 16A;„(1 + t)/Vn) < eM-^tVK) + nP{n^J. 



5. Sure Screening Properties with MMLE. In this section, we in- 
troduce a new framework for establishing the sure screening property with 
MMLE in the canonical exponential family ([1]). We divide into three sections 
to present our findings. 

5.1. Population Aspect. As fitting marginal regressions to a joint regres- 
sion is a type of model misspecification, an important question would be: 
at what level the model information is preserved. Specifically for screening 
purpose, we are interested in the preservation of the non-sparsity from the 
joint regression to the marginal regression. This can be summarized into the 
following two questions. First, for the sure screening purpose, if a variable 
Xj is jointly important (/3* 7^ 0), will (and under what conditions) it still 
be marginally important (/3j*^ 7^ 0)? Second, for the model selection consis- 
tency purpose, if a variable Xj is jointly unimportant (/3J = 0), will it still be 
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marginally unimportant {j3j^ = 0)? We aim to answer these two questions 
in this section. 

The following theorem reveals that the marginal regression parameter is 
in fact a measurement of the correlation between the marginal covariate and 
the mean response function. 

Theorem 2. For j = 1, . . . ,pn, the marginal regression parameters 
^ i/ and only if cov{b' (X^ (3*) , Xj) = 0. 

By using the fact that that X.'^ f3* = /3q + YljeM* '^^^ easily 

show the following corollary. 

Corollary 1. If the partial orthogonality condition holds, i.e., {Xj, j ^ 
Mi,} is independent of {Xi, i G M^,}, then /3j'^ = 0, for j ^ M-/,. 

This partial orthogonality condition is essentially the assumption made 
in Huang et al. (2008) who showed the model selection consistency in the 
special case with the ordinary linear model and bridge regression. Note that 
cov(6'(X"^/3*), Xj) = cov{Y, Xj). A necessary condition for sure screening 
is that the important variables Xj with Pj^O are correlated with the 
response, which usually holds. When they are correlated with the response, 
by Theorem O f3j'^ 0, for j G A^*. In other words, the marginal model 
pertains the information about the important variables in the joint model. 
This is the theoretical basis for the sure independence screening. On the 
other hand, if the partial orthogonality condition in Corollary 1 holds, then 
/Sj'^ = for j Mi,. In this case, there exists a threshold 7„ such that the 
marginally selected model is model selection consistent: 

min|/3f|>7„, max |/3f | = 0. 
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To have a sure screening property based on the sample version ([3]), we 
need 

min |/3f I > 0{n-^), 

for some k < 1/2 so that the marginal signals are stronger than the stochas- 
tic noise. The following theorem shows that this is possible. 

Theorem 3. // \ cov{b' (X^ P*) , Xj)\ > cin~'^ for j G Mi, and a posi- 
tive constant ci > 0, then there exists a positive constant C2 such that 

min |/3f I > csn"'^, 

provided that h"{ ) is hounded or 

EG{a\Xj\)\Xj\I{\Xj\ > n'') < dn"^, for some < r] < k, 

and some sufficiently small positive constants a and d, where G{\x\) = 
sup|„|<|a;| \b'{,u)\. 

Note that for the normal and Bernoulli distribution, b"{-) is bounded, 
whereas for the Poisson distribution, = exp(|x|) and Theorem 3 re- 

quires the tails of Xj to be light. Under some additional conditions, we will 
show in the proof of Theorem [5] that 

X;i/3fP = 0(||S/3*f) = 0(A^a.(5])), 
i=i 

where 5] = var(X), and Amax(S) is its maximum eigenvalue. The first equal- 
ity requires some efforts to prove, whereas the second equality follows easily 
from the assumption 

var(X'^/3'^) = (B^'^^fS* = 0(1). 
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The imphcation of this result is that there can not be too many variables 
that have marginal coefficient |/3j^^| that exceeds certain thresholding level. 
That achieves the sparsity in final selected model. 

When the covariates are jointly normally distributed, the condition of 
Theorem [3] can be further simplified. 

Proposition 1. Suppose that X and Z are jointly normal with mean 
zero and standard deviation 1. For a strictly monotonia function /(•), cov[X^ Z) 
if and only if cov{X, f{Z)) = 0, provided the latter covariance exists. In 
addition, 

\cov{XJ{Z))\ > \p\ inf \g'{x)\ EX^I{\X\ < c), 

\x\<c\p\ 

for any c > 0, where p = EXZ, g{x) = Ef{x + e) with e ~ -/V(0, 1 — p^). 

The above proposition shows that the covariance of X and f{Z) can be 
bounded from below by the covariance between X and Z, namely 

\cov{XJ{Z))\ > d\p\, d= inf \g'{x)\ EX^I{\X\ < c), 

\x\<c 

in which d > for a sufficiently small c. The first part of the proposition 
actually holds when the conditional density f(z\x) of Z given X is a mono- 



tonic likelihood family (iBickel and Doksum . 



2001 



when X is regarded as a 



parameter. By taking Z = a direct application of Theorem [2] is that 

/3f = if and only if 

cov{X.^(3*,Xj) = 0, 

provided that X is jointly normal, since b'{-) is an increasing function. Fur- 
thermore, if 

(4) |cov(X^/3*,Xj)| > con"^, k < 1/2, 
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for some positive constant cq, a minimum condition required even for the 
least-squares model (Fan and Lv, 2008), then by the second part of Propo- 
sition [H we have 

\coY{b'{X.^(3*),Xj)\ > cm''', 

for some constant ci. Therefore, by Theorem [2l there exists a positive con- 
stant C2 such that 

|/3f I > can-''. 

In other words, ([4]) suffices to have marginal signals that are above the 
maximum noise level. 

5.2. Uniform convergence and sure screening. To establish the SIS prop- 
erty of MMLE, a key point is to establish the uniform convergence of the 
MMLEs. That is, to control the maximum noise level relative to the sig- 
nal. Next we establish the uniform convergence rate for the MMLEs and 
sure screening property of the method in ([3]). The former will be useful in 
controlling the size of the selected set. 

Let (3j = denote the two-dimensional parameter and Xj = 

(IjXj)'^ . Due to the concavity of the log-likelihood in GLIM with the canon- 
ical link, EICX-J Pj,Y) has a unique minimum over Pj G i3 at an interior 
point /3f = (/3j|^,/3f where B = {\/3f^\ < B, < 5} is a square with 
the width B over which the marginal likelihood is maximized. The following 
is an updated version of Conditions A, B and C for each marginal regres- 
sion and two additional conditions for the covariates and the population 
parameters: 

A'. The marginal Fisher information: = E {b"{X.J Pj)JLjX.J} is 

finite and positive definite at Pj = (3^ , for j = 1, . . . ,pn- Moreover, 
is bounded from above. 
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B'. The second derivative of b{6) is continuous and positive. There exists 
an ei > such that for all j = 1, . . . ,pn, 

sup \Eb(Xjp)I{\Xj\ > Kn)\ < o(n-i). 

(3eB, \\l3~p^'\\<e, 

C. For all /3j G B, wehave E{l{X.Jf3j,Y)-l(Xjf3f ,Y)) > V\\Pj- (3f , 
for some positive V, bounded from below uniformly over j = 1, . . . ,pn- 

D. There exists some positive constants mo, mi, sq, si and a, such that 
for sufficiently large t, 

P{\Xj\ >t)< (mi - si) exp{-moi"}, for j = l,...,pn, 

and that 

£;exp(6(X^/3*+so)-&(X^/3*))+-Eexp(6(X^/3*-so)-^(X^/9*)) < si. 

E. The conditions in Theorem [3] hold. 

Conditions A'-C are satisfied in a lot of examples of generalized linear 
models, such as linear regression, logistic regression and Poisson regression. 
Note that the second part of Condition D ensures the tail of the response 
variable Y to be exponentially light, as shown in the following lemma: 

Lemma 1. If Condition D holds, for any t > 0, 

P{\Y\ > mot" /so) < siexp(-mot"). 

Let kn = b' {KnB+B)+moK^ / Sq. Then Condition B holds for exponential 
family ([1]) with K* = moK^/sQ. The Lipschitz constant fe„ is bounded for 
the logistic regression, since Y and b'{-) are bounded. The following theorem 
gives a uniform convergence result of MMLEs and a sure screening property. 
Interestingly, the sure screening property does not directly depend on the 
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property of the covariance matrix of the covariates such as the growth of its 
operator norm. This is an advantage over using the full likelihood. 

Theorem 4. Suppose that Conditions A', B\ C and D hold. 

(i) If n^^'^'^ / (k^K^) — )• oo, then for any C3 > 0, there exists a positive 
constant C4 such that 

p( max |;5f -/3f I > csn^A 
< pn{exp{-C4n^~^'^/ {knKnf) + nmi exp(-mo-fC")}. 

(ii) If, in addition, Condition E holds, then by taking 7„ = c^n^'^ with 
C5 < C2/2, we have 

p[m^ C M^^ > 1 - s„{exp(-C4n^~^''/(A:„K„)^) + nmiexp(-moA'^)}, 
where Sn = \-M.*\, the size of non- sparse elements. 

Remark 1. If we assume thatminj^M, \ cov{b' {X.'^ P*) , Xj)\ > cin~'^~^^ 
for any 5 > 0, then one can take 7^ = cn~'^^^^'^ for any c> in Theorem^ 
This is essentially the thresholding used in Fan and Lv (2008). 

Note that when b'{-) is bounded as the Bernoulli model, is a finite 
constant. In this case, by balancing the two terms in the upper bound of 
Theorem |3)^i) , the optimal order of is given by 

i^„ = n(i-2-)/("+2), 

and 



P ( max |/3f - /3f I > csn'A = O f]9„ exp(-C4n( 
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for a positive constant C4. When the covariates Xj are bounded, then kn and 
Kn can be taken as finite constants. In this case, 

P ( max |/3f - /3f I > cgn"'^ ) < O {p„ exp(-C4n^-2'^)} . 

In both aforementioned cases, the tail probabihty in Theorem U] is exponen- 
tiahy small. In other words, we can handle the NP-dimensionality: 

logp„ = o(n(i"2-W(-+2)), 

with a = 00 for the case of bounded covariates. 

For the ordinary linear model, kn = B[Kn + 1) + / (2so) and by taking 
the optimal order of Kn = rS^~'^'^^/'^ with A = max(Q + 4, 3a + 2), we have 

P (^^max \pf - /3f I > can-''^ = O {p„ exp(-C4n(i-2^)"/^)} . 

When the covariates are normal, a = 2 and our result is weaker than that 
given in Fan and Lv (2008) who permits logp„ = o(n^~^'') whereas Theo- 
rem[3]can only handle logp„ = o(n(^~^'^)/^). However, we allow non-normal 
covariate and other error distributions. 

The above discussion applies to the sure screening property given in The- 
orem m^ii) . It is only the size of non-sparse elements s„ that matters for the 
purpose of sure screening, not the dimensionality 

5.3. Controlling false selection rates. After applying the variable screen- 
ing procedure, the question arrives naturally how large the set A4^„ is. In 
other words, has the number of variables been actually reduced by the in- 
dependence learning? In this section, we aim to answer this question. 

A simple answer to this question is the ideal case in which 

cov(6'(X^/3*),Xj) = o{n-^), for j ^ M^. 
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In this case, under some mild conditions, we can show (see the proof of 
Theorem [3|) that 



max |/3f I = o(n-'^). 

This, together with Theorem UH^i) shows that 

max \BY\ < csn~^, for any C3 > 0, 

with probabihty tending to one if the probabihty in Theorem |3)^i) tends to 
zero. Hence, by the choice of 7.„ as in Theorem HlJ^ii) , we can achieve model 
selection consistency: 



PiM,„=M.) = 1-0(1). 



This 



Huang et al. 



_cind of condition was indeed implied by the condition in 
(j2008l ) in the special case with ordinary linear model using the bridge re- 
gression who draw a similar conclusion. 

We now deal with the more general case. The idea is to bound the size of 
the selected set ([3]) by using the fact var(y) is bounded. This usually implies 
var(X-'"/3*) = /3*'^S/3* = 0(1). We need the following additional conditions: 

F. The variance var(X"^/3*) is bounded from above and below. 

G. Either b"{-) is bounded or X^./ = {Xi, • • • , Xp^)'^ follows an elliptically 
contoured distribution, i.e., 

Xm = sJ/'iJU, 

and \Eb' (X^ (3*){X.^ ~ (^o)\ is bounded, where U is uniformly dis- 
tributed on the unit sphere in p-dimensional Euclidean space, inde- 
pendent of the nonnegative random variable R, and Si = var(Xj\,f)- 

Note that S = diag(0. Si) in Condition G', since the covariance matrices 
differ only in the intercept term. Hence, Amax 

(S) = Amax(5]i). The following 

result is about the size of A4j„. 
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Theorem 5. Under Conditions A', B', C, D, F and G, we have for 
o-iT-y In = csn"^'^, there exists a C4 such that 

P[\M^„\<0{n'^K,Un]\ 
> 1 - p„{exp(-C4n^~^''/ {knKnf) + nmi exp{-moK^)}. 

The right hand side probabihty has been explained in Section 5.2. Prom 
the proof of Theorem [5l we actually show that the number of selected vari- 
ables is of order ||5]/3*|p/7^, which is further bounded by 0{n^'^ Amax(S)} 
using var(X"^/3*) = 0(1). Interestingly, while the sure screening property 
does not depend on the behavior of SI, the number of selected variables is 
affected by how correlated the covariates are. When n2'^Amax(S)/p ^ 0, the 
number of selected variables are indeed negligible comparing to the original 
size. In this case, the percent of falsely discovered variables is of course neg- 
ligible. In particular, when Ainax(S) = 0{n'^), the size of selected variable 
is of order 0{n'^'^^'^). This is of the same order as in Fan and Lv (2008) for 
the multiple regression model with the Gaussian data who needs additional 
condition that 2k + t < 1. Our result is an extension of Fan and Lv (2008) 
even in this very specific case without the condition 2K + r < 1. In addition, 
our result is more intuitive: the number of selected variables is related to 
Amax(S), or more precisely ||S/3*|p and the thresholding parameter 7^. 

6. A likelihood ratio screening. In a similar variable screening prob- 
lem with generalized linear models. Fan et al. (2009) suggest to screen the 
variables by sorting the marginal likelihood. This method can be viewed as 
a marginal likelihood ratio screening, as it builds on the increments of the 
log-likelihood. In this section we show that the likelihood ratio screening is 
equivalent to the MMLE screening in the sense that they both possess the 
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sure screening property and that the number of selected variables of the two 
methods are of the same order of magnitude. 

We first formulate the marginal likelihood screening procedure. Let 

L,,n = Pn{/(/3o*',l^) - Z(Xpf j = 1, . . . ,Pn, 

and L„ = . . . ,Lp„,„)'^, where = argmin^gP„/(/5o, y). Correspond- 

ingly, let 

L* = e[i{(3^, Y) - /(Xj/3f , y)}, j = l,...,pn, 

and L* = (L^, . . . , L*^)'^, where /3q^ = argmin^jj£^/(/3o, Y). It can be shown 
that EY = 6'(/3o^) and that Y = 6'(/3q^), where Y is the sample average. 
We sort the vector L„ in a descent order and select a set of variables 

where z/.„ is a predefined threshold value. Such an independence learning 
ranks the importance of features according to their marginal contributions to 
the magnitudes of the likelihood function. The marginal likelihood screening 
and the MMLE screening share a common computation procedure as solving 
Pn optimization problems over a two dimensional parameter space. Hence 
the computation is much more feasible than traditional variable selection 
methods. 

Compared with MMLE screening, where the information utilized is only 
the magnitudes of the estimators, the marginal likelihood screening incor- 
porates the whole contributions of the features to the likelihood increments: 
both the magnitudes of the estimators and their associated variation. Under 
current condition (Condition C), the variance of the MMLEs are at a com- 
parable level (through the magnitude V, an implication of the convexity 
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of the objective functions), and the two screening methods are equivalent. 
Otherwise, if V depends on n, the marginal likelihood screening can still 
preserve the non-sparsity structure, while the MMLE screening may need 
some corresponding adjustments, which we will not discuss in detail as it is 
beyond the scope of the current paper. 

Next we will show that the sure screening property holds under certain 
conditions. Similarly to the MMLE screening, we first build the theoreti- 
cal foundation of the marginal likelihood screening. That is, the marginal 
likelihood increment is also a measurement of the correlation between the 
marginal covariate and the mean response function. 

Theorem 6. For j = the marginal likelihood increment 

^ = if and only if cov{b'{X.^ (3*),Xj) = 0. 

As a direct corollary of Theorem [H we can easily show the following 
corollary for the purpose of model selection consistency. 

Corollary 2. If the partial orthogonality condition in Corollary Ul 
holds, then L*- = 0, for j ^ M.^,. 

We can also strengthen the result of minimum signals as follows. On the 
other hand, we also show that the total signals can not be too large. That 
is, there can not be too many signals that exceed certain threshold. 

Theorem 7. Under the conditions in Theorem and the Condition 
C, we have 

min \L*\> cqh''^'^, 
jeM* 

for some positive constant cg, provided that | cou(6'(X"^/3*), > cin~'^ for 
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j £ A4-k. If, in addition, Conditions F and G hold, then 

||L*|| = 0(||/3^^f ) = 0(||5]/3^f ) = 0(A.,ax(S)). 

The technical challenge is that the stochastic noise ||L„ — L*||oo is usually 
of the order of ©(n"^" + n~^/^ logp„), which can be an order of magnitude 
larger than the signals given in Theorem [71 unless k < 1/4. Nevertheless, 
by a different trick that utilizes the fact that ranking is invariant under a 
strict monotonic transform, we are able to demonstrate the sure screening 
independence property for k < 1/2. 

Theorem 8. Suppose that Conditions A', B', C and D, E and F 
hold. Then, by taking Vn = cjn"'^'^ for a sufficiently small cj > 0, there 
exists a cs > such that 

P(^M^ C > 1 - Sn{exp(-C8n^-2^/(/c„Js:„)2) + nmi exp{-moK^)}. 

Similarly to the MMLE screening, we can control the size oiMy^ as follows. 
For simplicity of the technical argument, we focus only on the case where 
h"{-) is bounded. 

Theorem 9. Under Conditions A', B', C, D, F and G, ifh"{-) is 
hounded, then we have 

P[|A/-.J < 0{n2-A^ax(S)}] 
> 1 - p„{exp(-C8n^"^''/ {knKnf) + nmi exp(-moi^")}. 

7. Numerical Results. In this section, we present several simulation 
examples to evaluate the performance of SIS procedure with generalized lin- 
ear models. It was demonstrated in Fan and Lv (2008) and Fan et al. (2009) 
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that independent screening is a fast but crude method of reducing the dimen- 
sionahty to a more moderate size. Some methodological extensions include 
iterative SIS (ISIS) and multi-stage procedures, such as SIS-SCAD and SIS- 
LASSO, can be applied to perform the final variable selection and parameter 
estimation simultaneously. Extensive simulations on these procedures were 
also presented in Fan et al. (2009). To avoid repetition, in this paper, we 
focus on the vanilla SIS, with the aim to evaluate the sure screening prop- 
erty and to demonstrate some factors influencing the false selection rate. 
We vary the sample size from 80 to 600 for different scenarios to gauge the 
difficulties of the simulation models. The following three configurations with 
p = 2000, 5000 and 40000 predictor variables are considered for generating 
the covariates X = {Xi, • • • , Xp)^: 

SI. The covariates are generated according to 

(5) X, = £l±«£, 

where e and {ej}^!'^ are i.i.d. standard normal random variables, 
{^i}ji|jf/3]+i i.i.d. and follow a double exponential distributions 
with location parameter zero and scale parameter one, and {^j}^—[2p/3]+i 
are i.i.d. and follow a mixture normal distribution with two compo- 
nents A^(— 1,1), A^(l,0.5) and equal mixture proportion. The covari- 
ates are standardized to be mean zero and variance one. The con- 
stants {aj}'j^-^ are the same and chosen such that the correlation 
p = corr(Xj, Xj) = 0, 0.2, 0.4, 0.6 and 0.8, among the first q variables, 
and Uj = for j > q. The parameter q is also related to the overall 
correlation in the covariance matrix. We will present the numerical 
results with q = 15 for this setting. 
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Table 1 

The median of the 200 empirical maximum eigenvalues, with its robust estimate of SD in 
the parenthesis, of the corresponding sample covariance matrices of covariates based 200 
simulations with partial combinations of n = 80, 300, 600, p — 2000, 5000, 40000 and 
q — 15,50 in the first two settings (SI and S2). 



{P,n) 


Setting 





0.2 


P 

0.4 


0.6 


0.8 


(40000,80) 


SI (q= 


= 15) 


549.9(1.4) 


550.1(1.4) 


550.1(1.3) 


550.1(1.3) 


550.1(1.4) 


(40000,80) 


S2 (q= 


=50) 


550.0(1.4) 


550.1(1.4) 


550.4(1.5) 


552.9(1.8) 


558.5(2.4) 


(40000,300) 


SI (q= 


=15) 


157.3(0.4) 


157.4(0.4) 


157.4(0.4) 


157.4(0.3) 


157.7(0.4) 


(40000,300) 


S2 (q= 


=50) 


157.4(0.4) 


157.5(0.4) 


160.9(1.2) 


168.2(1.0) 


176.9(1.0) 


(5000,300) 


SI (q= 


=15) 


25.68(0.2) 


25.68(0.2) 


26.18(0.2) 


27.99(0.4) 


30.28(0.4) 


(5000,300) 


SI (q= 


=50) 


25.69(0.1) 


29.06(0.5) 


37.98(0.7) 


47.49(0.7) 


57.17(0.5) 


(2000,600) 


SI (q= 


=15) 


7.92(0.07) 


8.32(0.15) 


10.5(0.3) 


13.09(0.3) 


15.79(0.2) 


(2000,600) 


SI (q= 


=50) 


7.93(0.07) 


14.62(0.40) 


23.95(0.7) 


33.90(0.6) 


43.56(0.5) 


(2000,600) 


S2 (q= 


=50) 


7.93(0.07) 


14.62(0.40) 


23.95(0.7) 


33.90(0.6) 


43.56(0.5) 



52. The covariates are also generated from ([5]), except that {ujYj^-^ are 
i.i.d. normal random variables with mean a and variance 1 and aj = 
for j > q. The value of a is taken such that Ecorr{Xi, Xj) = 
0, 0.2, 0.4, 0.6 and 0.8, among the first q variables. The simulation 
results to be presented for this setting use q = 50. 

53. Let be i.i.d. standard normal random variables and 

s 

Xk = J2 ^i(-l)'^V5 + \/25^/5efe, k = p - A9, ■ ■ ■ ,p, 

where {£fc}fc=p_49 are standard normally distributed. 

Table 1 summarizes the median of the empirical maximum eigenvalues 
of the covariance matrix and its robust estimate of the standard deviation 
(RSD) based 200 simulations in the first two settings (SI and S2) with par- 
tial combinations of sample size n = 80, 300, 600, p = 2000, 5000, 40000 and 
q = 15, 50. RSD is the interquantile range (IQR) divided by 1.34. The empir- 
ical maximum eigenvalues are always larger than their population version, 
depending on the realizations of the design matrix. The empirical minimum 
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eigenvalue is always zero, and the empirical condition numbers for the sample 
covariance matrix are infinite, since p > n. Generally, the empirical maxi- 
mum eigenvalues increase as the correlation parameters p, q, the numbers 
of covariates p increase, and/or the sample sizes n decrease. 

With these three settings, we aim to illustrate the behaviors of the two 
SIS procedures under different correlation structures. For each simulation 
and each model, we apply the two SIS procedures, the marginal MLE and 
the marginal likelihood ratio methods, to screen variables. The minimum 
model size (MMS) required for each method to have a sure screening, i.e. 
to contain the true model Aii,, is used as a measure of the effectiveness 
of a screening method. This avoids the issues of choosing the thresholding 
parameter. To gauge the difficulty of the problem, we also include the LASSO 
and the SCAD as references for comparison when p = 2000 and 5000. The 
smaller p is used due to the computation burden of the LASSO and the 
SCAD. In addition, as demonstrated in our simulation results, they do not 
perform well when p is large. Our initial intension is to demonstrate that 
the simple SIS does not perform much worse than the far more complicated 
procedures like the LASSO and the SCAD. To our surprise, the SIS can even 
outperform those more complicated methods in terms of variable screening. 
Again, we record the MMS for the LASSO and the SCAD for each simulation 
and each model, which does not depend on the choice of regularization 
parameters. When the LASSO or the SCAD can not recover the true model 
even with the smallest regularization parameter, we average the model size 
with the smallest regularization parameter and p. These interpolated MMS' 
are presented with italic font in Tables 3-5 and 9 to distinct from the real 
MMS. Results for logistic regressions and linear regressions are presented in 
the following two subsections. 
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7.1. Logistic Regressions. The generated data (X^, Yi), . . . , (X^, Yn) are 
n i.i.d. copies of a pair (X^, Y), in which the conditional distribution of the 
response Y given X = x is binomial distribution with probability of success 
p(x) = exp(x'^/3*)/[l + exp(x"^/3*)]. We vary the size of the nonsparse set 
of coefficients as s = 3, 6, 12, 15 and 24. For each simulation, we evaluate 
each method by summarizing the median minimum model size (MMMS) 
of the selected models as well as its associated RSD, which is the associ- 
ated interquartile range (IQR) divided by 1.34. The results, based on 200 
simulations for each scenario are recorded in the second and third panel of 
Table 2 and the second panel of Tables 3-5. Specifically, Table 2 records the 
MMMS and the associated RSD for SIS under the first two settings when 
p = 40000, while Table 3-5 record these results for SIS, the LASSO and the 
SCAD when p = 2000 and 5000 under Settings 1, 2 and 3 respectively. The 
true parameters are also recorded in each corresponding table. 

To demonstrate the difficulty of our simulated models, we depict the dis- 
tribution, among 200 simulations, of the minimum |t|-statistics of s esti- 
mated regression coefficients in the oracle model in which the statistician 
does not know that all variables are statistically significant. This shows the 
difficulty to recover all significant variables even in the oracle model with 
the minimum model size s. The distribution was computed for each setting 
and scenario but only a few selected settings are shown presented in Figure 
1. In fact, the distributions under Setting 1 are very similar to those under 
Setting 2 when the same q value is taken. It can be seen that the magnitude 
of the minimum |t|-statistics is reasonably small and getting smaller as the 
correlation within covariates (measured by p and q) increases, sometimes 
achieving three decimals. Given such small signal-to- noise ratio in the ora- 
cle models, the difficulty of our simulation models is a self-evident even if 
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Table 2 

The MMMS and the associated RSD (m the parenthesis) of the simulated examples for 
logistic regressions in the first two settings (SI and S2) when p — 40000. 



p 




•^T"^ A/TT R 

O J- O - iVi ij JA. 


SIS-MMLE 


n 


SIS-MLR 


SIS-MMLE 








Setting 1, q = 


15 






s 


_ q a* — 

— p — 


(1,1.3,1)"^ 


s = 


6, /3* = (1,1,3,1,...)^ 


U 


ouu 


O ( .0(^001 ) 


89(375) 


300 


47(164) 


50(170) 


0.2 


200 


3(0) 


3(0) 


300 


6(0) 


6(0) 


0.4 


200 


3(0) 


3(0) 


300 


7(1) 


7(1) 


0.6 


200 


3(1) 


3(1) 


300 


8(1) 


8(2) 


0.8 


200 


4(1) 


4(1) 
;i,1.3,...)^ 


300 


9(3) 


9(3) 
(1,1.3,...)^ 




s — 


12, r = 


s - 


-- 15, /3* = 





500 


297(589) 


302.5(597) 


600 


350(607) 


359.5(612) 


0.2 


300 


13(1) 


13(1) 


300 


15(0) 


15(0) 


0.4 


300 


14(1) 


14(1) 


300 


15(0) 


15(0) 


0.6 


300 


14(1) 


14(1) 


300 


15(0) 


15(0) 


0.8 


300 


14(1) 


14(1) 


300 


15(0) 


15(0) 








Setting 2, q = 


50 






s 


= 3,(3* = 


(1,1.3,1)"^ 


s = 


6, /3* = (1,1.3,1,...)^ 





300 


84.5(376) 


88.5(383) 


500 


6(1) 


6(1) 


0.2 


300 


3(0) 


3(0) 


500 


6(0) 


6(0) 


0.4 


300 


3(0) 


3(0) 


500 


6(1) 


6(1) 


0.6 


300 


3(1) 


3(1) 


500 


8.5(4) 


9(5) 


0.8 


300 


5(4) 


5(4) 

;i,i.3,...)^ 


500 


13.5(8) 


14(8) 
(1,1.3,...)^ 




s — 


12, r = 


s - 


= 15, (3* = 





600 


77(114) 


78.5(118) 


800 


46(82) 


47(83) 


0.2 


500 


18(7) 


18(7) 


500 


26(6) 


26(6) 


0.4 


500 


25(8) 


25(10) 


500 


34(7) 


33(8) 


0.6 


500 


32(9) 


31(8) 


500 


39(7) 


38(7) 


0.8 


500 


36(8) 


35(9) 


500 


40(6) 


42(7) 
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the signals seem not that smah. 

Table 3 

The MMMS and the associated RSD (m the parenthesis) of the simulated examples for 
logistic regressions in Setting 1 (SI) when p—5000 and q=15. The values with italic font 
indicate that the LASSO or the SCAD can not recover the true model even with smallest 
regularization parameter and are estimated. 



p 


n 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 








= 3, = (1,1.3,1)^ 







300 


3(0) 


3(0) 


3(1) 


3(1) 


0.2 


300 


3(0) 


3(0) 


3(0) 


3(0) 


0.4 


300 


3(0) 


3(0) 


3(0) 


3(0) 


0.6 


300 


3(0) 


3(0) 


3(0) 


3(1) 


0.8 


300 


3(1) 


3(1) 


4(1) 


4(1) 






s = 6, P* = (1,1.3,1,1.3,1,1.3)^ 







200 


8(6) 


9(7) 


7(1) 


7(1) 


0.2 


200 


18(38) 


20(39) 


9(4) 


9(2) 


0.4 


200 


51(77) 


64.5(76) 


20(10) 


16.5(6) 


0.6 


300 


77.5(139) 


77.5(132) 


20(13) 


19(9) 


0.8 


400 


306.5(347) 


313(336) 


86(40) 


70.5(35) 






s = 


= 12, /3* = (1,1.3,...)^ 







300 


297.5(359) 


300(361) 


72.5(3704) 


12(0) 


0.2 


300 


13(1) 


13(1) 


12(1) 


12(0) 


0.4 


300 


14(1) 


14(1) 


14(1861) 


13(1865) 


0.6 


300 


14(1) 


14(1) 


2552(85) 


12(3721) 


0.8 


300 


14(1) 


14(1) 


2556(10) 


12(3722) 






s 


= 15, f3* = {3,4,...f 







300 


479(622) 


482(615) 


69.5(68) 


15(0) 


0.2 


300 


15(0) 


15(0) 


16(13) 


15(0) 


0.4 


300 


15(0) 


15(0) 


38(3719) 


15(3720) 


0.6 


300 


15(0) 


15(0) 


2555(87) 


15(1472) 


0.8 


300 


15(0) 


15(0) 


2552(8) 


15(1322) 



The MMMS and RSD with fixed correlation (SI) and random correla- 
tion (S2) are comparable under the same q. As the correlation increases 
and/or the nonsparse set size increases, the MMMS and the associated RSD 
usually increase for all SIS, the LASSO and the SCAD. Among all the de- 
signed scenarios of Settings 1 and 2, SIS performs well, while the LASSO 
and the SCAD occasionally fail under very high correlations and relatively 
large nonsparse set size (s=12, 15 and 24). Interestingly, correlation within 
covariates can sometimes help SIS reduce the false selection rate, as it can 
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Table 4 

The MMMS and the associated RSD (m the parenthesis) of the simulated examples for 
logistic regressions in Setting 2 (S2) when p—2000 and q=50. The values with italic font 
have the same meaning as Table 2. 



p 




SIS-MLR 


SIS-MMLE LASSO 


SCAD 








= 3, /3* = (3,4,3)^ 




n 




3(0) 


3(0) 3(0) 








3(0) 


3(0) 3(0) 




n 4 


9nn 


3(0) 


3(0) 3(0) 




n Pi 




3(1) 


3(1) 3(1) 




n s 

u.o 


9nn 


5(5) 


5.5(5) 6(4) 








s = 6, /3* = (3, -3, 3, -3, 3, -3)^ 




n 

u 


9nn 


8(6) 


9(7) 7(1) 


7('\\ 


n 9 


9nn 

^uu 


18(38) 


20(39) 9(4) 




0.4 


200 


51(77) 


64.5(76) 20(10) 


16.5(6) 


0.6 


300 


77.5(139) 


77.5(132) 20(13) 


19(9) 


0.8 


400 


306.5(347) 


313(336) 86(40) 
12, /3* = (3,4,...)^ 


70.5(35) 






s - 







600 


13(6) 


13(7) 12(0) 


12(0) 


0.2 


600 


19(6) 


19(6) 13(1) 


13(2) 


0.4 


600 


32(10) 


30(10) 18(3) 


17(4) 


0.6 


600 


38(9) 


38(10) 22(3) 


22(4) 


0.8 


600 


38(7) 


39(8) 1071(6) 
24, /3* = (3,4,...)^ 


1042(34) 






s = 







600 


180(240) 


182(238) 35(9) 


31(10) 


0.2 


600 


45(4) 


45(4) 35(27) 


32(24) 


0.4 


600 


46(3) 


47(2) 1099(17) 


1093(1456) 


0.6 


600 


48(2) 


48(2) 1078(5) 


1065(23) 


0.8 


600 


48(1) 


48(1) 1072(4) 


1067(13) 



increase the marginal signals. It is notable that the LASSO and the SCAD 
usually can not select the important variables in the third setting, due to 
the violation of the irrepresentable condition for s = 6, 12 and 24, while SIS 
perform reasonably well. 

7.2. Linear Models. The generated data (X^, Yi), . . . , (X^, y„) are n 
i.i.d. copies of a pair (X^,y), in which the response Y follows a linear 
model with Y = X^/3* + e, where the random error e is standard normally 
distributed. The covariates are generated in the same manner as the logis- 
tic regression settings. We take the same true coefficients and correlation 
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Table 5 

The MMMS and the associated RSD (m the parenthesis) of the simulated examples for 
logistic regressions in Setting 3 (S3) when p—2000 and n=600. The values with italic 
font have the same meaning as Table 2. M-Amax and its RSD have the same meaning as 

Table 1. 



s 


M-A„,ax{RSD) 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 


3 


8.47(0.17) 


3(0) 


3(0) 


3(1) 


3(0) 


6 


10.36(0.26) 


56(0) 


56(0) 


1227(7) 


1142(64) 


12 


14.69(0.39) 


63(6) 


63(6) 


1148(8) 


1093(59) 


24 


23.70(0.14) 


214.5(93) 


208.5(82) 


1120(5) 


1087(24) 



structures for part of the scenarios {p = 40000) as the logistic regression 
examples, while vary the true coefficients for other scenarios, to gauge the 
difficulty of the problem. The sample size for each scenario is correspond- 
ingly decreased to reflect the fact that the linear model is more informative. 
The results are recorded in Tables 6-9 respectively. The trend of the MMMS 
and the associated RSD of SIS, the LASSO and the SCAD varying with the 
correlation and/or the nonsparse set size are similar to these in the logistic 
regression examples, but their magnitudes are usually smaller in the lin- 
ear regression examples, as the model is more informative. Overall, the SIS 
does a very reasonable job in screening irrelevant variables and sometimes 
outperforms the LASSO and the SCAD. 

8. Conclusion Remarks. In this paper, we propose two independent 
screening methods by ranking the maximum marginal likelihood estima- 
tors and the max i mum marginal likelihood in generalized linear models. 
With 



Fan and Lvl (|2008l ) as a special case, the proposed method is shown 
to possess the sure independence screening property. The success of the 
marginal screening embarks the idea that any surrogates screening, besides 
the marginal utility screening introduced in this paper, as long as which 
can preserve the non-sparsity structure of the true model and is feasible in 
computation, can be a good option for population variable screening. It also 
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Table 6 

The MMMS and the associated RSD (m the parenthesis) of the simulated examples in 
the first two settings (SI and S2) for linear regressions when p — 40000. 



p 




A/TT R 
oio-iviijjr\ 


A/TA/TT F 

1 - J. Vi iVl 1j IZj 




SIS-MLR 


SIS-MMLE 








Setting 




15 






5 


— p — 




S 


6, /3* = (1,1,3,1,...)"^ 


n 
U 


oU 


Izl^lo ) 




lOU 


42(157) 


42(157) 


0.2 


80 


3(0) 


3(0) 


150 


6(0) 


6(0) 


0.4 


80 


3(0) 


3(0) 


150 


6.5(1) 


6.5(1) 


0.6 


80 


3(0) 


3(0) 


150 


6(1) 


6(1) 


0.8 


80 


3(0) 


3(0) 
(1,1.3,...)^ 


150 


7(1) 


7(1) 




s = 


12, (3* = 


s = 


-- 15, r = (1 


1.3,...)^ 





300 


143(282) 


143(282) 


400 


135.5(167) 


135.5(167) 


0.2 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 


0.4 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 


0.6 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 


0.8 


200 


13(1) 


13(1) 


200 


15(0) 


15(0) 








Setting 


2, 9 = 


50 






s 


= 3, /3* = 


(1,1.3,1)"^ 


s — 


6, /3* = (1,1.3,1,...)^ 





100 


3(2) 


3(2) 


200 


7.5(7) 


7.5(7) 


0.2 


100 


3(0) 


3(0) 


200 


6(1) 


6(1) 


0.4 


100 


3(0) 


3(0) 


200 


7(1) 


7(1) 


0.6 


100 


3(0) 


3(0) 


200 


7(2) 


7(2) 


0.8 


100 


3(1) 


3(1) 
(1,1.3,...)^ 


200 


8(4) 


8(4) 
1.3,...)^ 




s — 


12, (3* = 


s = 


-- 15, r = (1 





400 


22(27) 


22(27) 


500 


35(52) 


35(52) 


0.2 


300 


16(5) 


16(5) 


300 


24(7) 


24(7) 


0.4 


300 


19(8) 


19(8) 


300 


30(10) 


30(10) 


0.6 


300 


25(8) 


25(8) 


300 


33.5(7) 


33.5(7) 


0.8 


300 


24(7) 


24(7) 


300 


35(8) 


35(8) 
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Table 7 

The MMMS and the RSD (m the parenthesis) of the simulated examples for linear 
regressions in Setting 1 (SI) when p = 5000 and q = 15. 



p 


n 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 






s — 


3, 13* = (0.5,0.67,0.5)'' 







100 


12(40) 


12(40) 


3(1) 


3(1) 


0.2 


100 


3(1) 


3(1) 


3(0) 


3(0) 


0.4 


100 


3(0) 


3(0) 


3(0) 


3(0) 


0.6 


100 


3(1) 


3(1) 


5(7) 


5(5) 


0.8 


100 


4(2) 


4(2) 


4(1) 


4(1) 

T 






s^6,l3*-- 


= (0.5,0.67,0.5,0.67,0.5,0.67) 





100 


210.5(422) 210.5(422) 


33.5(651) 


25(22) 


0.2 


100 


7(2) 


7(2) 


6(1) 


6(1) 


0.4 


100 


7(2) 


7(2) 


6(1) 


6(1) 


0.6 


100 


8(2) 


8(2) 


7(1) 


7(1) 


0.8 


100 


9(3) 


9(3) 


7(2) 
67,...)^ 


8(1) 






s — 


12, /3* = (0.5,0. 







300 


49(76) 


49(76) 


12(1) 


12(0) 


0.2 


100 


14(2) 


14(2) 


12(1) 


12(1) 


0.4 


100 


14(1) 


14(1) 


12(1) 


12(1) 


0.6 


100 


14(1) 


14(1) 


13(1) 


13(1) 


0.8 


100 


14(1) 


14(1) 


13(1) 
67,...)^ 


13(1) 






s — 


15, (3* = (0.5,0. 







300 


199(251) 


199(251) 


17(2) 


15(0) 


0.2 


100 


17(5) 


17(5) 


15(1) 


15(0) 


0.4 


100 


15(0) 


15(0) 


15(0) 


15(0) 


0.6 


100 


15(0) 


15(0) 


15(0) 


15(0) 


0.8 


100 


15(0) 


15(0) 


15(0) 


15(1) 
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Table 8 

The MMMS and the RSD (m the parenthesis) of the simulated examples for linear 



regressions 


in Setting 


2 (S2) when p = 


= 2000 and 


q = 50. 


P 


n 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 






s = 


3, P* = (0.6,0.8,0.6)' 







100 


5(14) 


6(16) 


4(4) 


4(2] 


0.2 


100 


3(1) 


3(1) 


3(0) 


3(0) 


0.4 


100 


3(1) 


4(1) 


3(1) 


3(1) 


0.6 


100 


5(3) 


7(5) 


4(1) 


4.(1) 


0.8 


100 


7(7) 


14(12) 


5(57) 


7(4) 






s = 6, /3* = (3,-3,3,- 


3,3,-3)"^ 







100 


15(43) 


18(47) 


6(0) 




0.2 


100 


42(116) 


47(99) 


7(1) 


7(1) 


0.4 


100 


143(207) 


129(226) 


12(4) 


12(5) 


0.6 


200 


47(93) 


49(110) 


7(1) 


7(1) 


0.8 


200 


360(470) 


376.5(486) 


54(32) 


51(25) 






s = 


12, /3* = (0.6,0. 







200 


151(212) 


140(207) 


15(4) 


15(4) 


0.2 


100 


37.5(10) 


36(12) 


16(3) 


16(4) 


0.4 


100 


39(7) 


40.5(8) 


18(3) 


17(2) 


0.6 


100 


41(7) 


42(6) 


19(3) 


18(3) 


0.8 


100 


44(5) 


46(6) 


23(1478) 
■■■f 


24(50) 






s - 


= 24, /3* = (3,4, 







400 


229(283) 


227(279) 


24(0) 


25(0) 


0.2 


100 


61(43) 


67(46) 


30(2) 


30(2) 


0.4 


100 


48(2) 


47(2) 


31(2) 


30(1) 


0.6 


100 


48(2) 


49(2) 


32(2) 


32(3) 


0.8 


100 


49(2) 


49(1) 


32(2) 


32(2) 



Table 9 

The MMMS and the associated RSD (m the parenthesis) of the simulated examples for 
linear regressions in Setting 3 (S3), where p=2000 and n=600. The values with italic 
font have the same meaning as Table 2. M-Amax and its RSD have the same meaning as 

Table 1. 



s 


M-A„iax(RSD) 


SIS-MLR 


SIS-MMLE 


LASSO 


SCAD 


3 


8.47(0.17) 


3(0) 


3(0) 


3(0) 


3(0) 


6 


10.36(0.26) 


56(0) 


56(0) 


47(4) 


45(3) 


12 


14.69(0.39) 


62(0) 


62(0) 


1610(10) 


1304(2) 


24 


23.70(0.14) 


81(19) 


81(23) 


1637(14) 


1303(1) 
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paves the way for the sample variable screening, as long as the surrogate 
signals are uniformly distinguishable from the stochastic noise. Along this 
line, many statistics, such as R square statistics, marginal pseudo likelihood 
(least square estimation, for example), can be potential basis for the inde- 
pendence learning. Meanwhile the proposed properties: sure screening and 
vanishing false selection rate will be good criteria for evaluating ultrahigh 
dimensional variable selection methods. 

As our current results only hold when the log-likelihood function is con- 
cave in the regression parameters, the proposed procedure does not cover all 
generalized linear models, such as some noncanonical link cases. This leaves 
space for future research. 

Unlike Fan and Lv (2008), the main idea of our technical proofs is broadly 
applicable. We conjecture that our results should hold when the conditional 
distribution of the outcome Y given the covariates X depends only on 'K^ (3* 
and is arbitrary and unknown otherwise. Therefore, besides GLIM, the SIS 
method can be applied to a rich class of genera l regr e ssion models, includ - 



ing transformation models (IBicke 



censo red regression models (jCoxl . 



and Doksum 



197a 



1981 



Box and Cox. 



Kosorok et al 



2004 



196J1, 



20071 ). projection pursuit regression (jFriedman and Stuetzk 



Zeng and Linl . 



1981 



). These 



are also interesting future research topics. 

Another important extension is to generalize the concept of marginal re- 
gression to the marginal group regression, where the number of covariates 
m in each marginal regression is greater or equal to one. This leads to a 
new procedure called grouped variables screening. It is expected to improve 
the situation when the variables are highly correlated and jointly impor- 
tant, but marginally the correlation between each individual variable and 
the response is weak. The current theoretical studies for the componentwise 
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marginal regression can be directly extended to group variable screening, 
with appropriate conditions and adjustments. This leads to another inter- 
esting topic of future research. 

In practice, how to choose the tuning parameter 7„ is an interesting and 
important problem. As discussed in Fan and Lv (2008), for the first stage 
of the iterative SIS procedure, our preference is to select sufficiently many 
features, such that I^W-y^J = n or n/log(n). The FDR-based methods in 
multiple comparison can also possibly employed. In the second or final stage, 
Bayes information type of criterion can be applied. In practice, some data- 
driven methods may also be welcome for choosing the tuning parameter 7„. 
This is an interesting future research topic and is beyond the scope of the 
current paper. 

9. P roofs. To establish Theorem [H t he following symmetrization the- 



orem m 



van der Vaart and Wellnerl (jl996l ). contraction theorem in Ledoux 



and Talagrand (1991) and concentration theorem in Massart (2000) will be 
needed. We reproduce them here for the sake of readability. 

Lemma 2. (Symmetrization, Lemma 2.3.1, van der Vaart and Well- 
ner, 1996) Let Zi, . . . , Z.„ he independent random variables with values in Z 
and T is a class of real valued functions on Z. Then 

£;|sup|(P„-P)/(Z)|| < 2i?|sup|P„e/(Z)||, 

where ei, . . . be a Rademacher sequence (i.e., i.i.d. sequence taking values 
±1 with probability 1/2) independent of Zi, . . . , Zn and Pf{Z) = Ef{Z). 
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Lemma 3. (Contraction theorem. \Ledoux and Talagrana . 1991) Let 
zi,...,Zn be nonrandom elements of some space Z and let T he a class 
of real valued functions on Z. Let ei,...,e„ he a Rademacher sequence. 
Consider Lipschitz functions 7^ : R i— t- R, that is, 



hi{s) - 7i(s)| < |s - s\, Vs,s G R. 



Then for any function / : ^ 1— )• R, we have 



E{snp |P„e(7(/) - 7(/))|} < 2i?{sup |P„,e(/ - /)|}. 



Lemma 4. ( Concentration theorem, \Massarn . 2000) Let Zi, . . . , Z„ he 
independent random variables with values in some space Z and let 7 S F, 
a class of real valued functions on Z. We assume that for some positive 
constants li^-y and Ui^^, < "f{Zi) < Ui^^ V7 G T . Define 



= sup ^^(uj ^ — and 

Z = sup|(P„-P)7(Z)|, 
7Gr 



then for any t > 



P{Z> EZ + t) < exp 



nt^ 
2I2 



Let iV > 0, define a set of /3: 



Let 



^(iV) = {/3Gfi,||/3-/3o||<iV}. 



h{N)= sup |(P„-P){/(X^/3,n-^(X'/9o,>^)Kn(X,m, 
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where /n(X, Y) is defined in Condition B. The next resuh is about the upper 
bound of the tail probability for Gi(A^) in the neighborhood of B{N). 



Lemma 5. For all t > 0, it holds that 

P{<Gi{N) > ANkn{q/n)'/\l + t)) < exp(-2tV^n)- 

Proof of Lemma The main idea is to apply the concentration theorem 
(Lemma Uj). To this end, we first show that the random variables involved 
are bounded. By Condition B and the Cauchy- Schwartz inequality, we have 
that on the set 

K(X^/3,y)-KX^/3o,m <fcn|X^(/3-/3o)| < A:„||X|| ||/3-/3o||. 

On the set fi^, by the definition of B{N), the above random variable is 
further bounded by knQ^^'^KnN . Hence, = ik'^qK^N'^ , using the notation 
of Lemma HI 

We need to bound the expectation EGi{N). An application of the sym- 
metrization theorem (Lemma [2]) yields that 



EGi{N) < 2E 



sup |P„e {/(X^/3, Y) - Z(X^/3o, Y)} I„(X, Y) \ 
,l3eB{N) 



By the contraction theorem (Lemma [3]), and the Lipschitz condition in Con- 
dition B, we can bound the right hand side of the above inequality further 

by 

(6) 4fc„s| sup |P„eX^(/3-^oKn(X,y)| i. 

By the Cauchy- Schwartz inequality, the expectation in ([6]) is controlled by 

(7) ^||P„eX/„(X,y)|| sup ll/3-^oll < ^l|IPneX/„(X,y)||iV. 

f3ei3(N) 
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By Jensen's inequality, the expectation in ([7]) is bounded above by 

(^||P„eXI„(X,y)||2)^/' = (i5;||Xfl„(X,y)/n)'/' < {q/n)^/\ 

by noticing that 

^||X||2/„(X, Y) < ^||Xf = E{Xl + ...+xl)=q, 

since EX'j = 1. Combining these results, we conclude that 

EGi{N) < mkniq/nf"^. 

An application of the concentration theorem (Lemma H]) yields that 

n{mkn{q/nYlHY\ 



P{Gi{N) > 4Nkn{q/n)^/^{l + t)) < exp 



SqK^kim 



This proves the lemma. □ 



Proof of Theorem d The proof takes two main steps: we first bound 
11/3 — (3q\\ by G{N) for a small N, where chosen so that Conditions B and 



C hold, and then utilize Lem ma [3 to cone 



Following a similar idea in 



van de Geer 



ude. 



(I2OO2I ). we define a convex com- 



bination = sf3 + {1 — s)Pq with 

s= (l + ||/3-/3oll/iV; 
Then, by definition, 

||/3,-/3oll=s||^-/3o||< A, 

namely, Pg S S{N). Due to the convexity, we have 

P„/(X^/3„F) < sP„7(X^/3,F) + (l-s)P„Z(X^/3o,n 
(8) < FJ{X^f3o,Y). 
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Since (3q is the minimizer, we have 

^[/(x'^/3„y)-/(x^/3o,y)] >o, 

where (3^ is regarded a parameter in the above expectation. Hence, it fohows 
from dHD that 

< (i?-p„)[/(x^/3„y)-/(x^/3o,y)] 

< G{N), 

where 

G{N)= sup \{Fn-Pmx.^(3,Y)-l{X.^f3o,Y)}\. 
PeB{N) 

By Condition C, it fohows that 

(9) ||/3,-/3o||<[G(iV)/K]V2. 

We now use ([9]) to conclude the resuh. Note that for any x, 

P{\\Ps-M>x)<P{G{N)>VnX^). 

Setting X = N/2, we have 

P(||/3, - /3o|| > N/2) < P{G{N) > VnN'/A}. 

Using the definition of f3g, the left hand side is the same as P{||/3— /3q|| > N}. 
Now, by taking = 4o„(l + t)/Vn with a.„ = Akn\/q/n, we have 

P{\\P - /3ol| > A^} < P{G{N) > VnN^/4:} 

= P{G{N)>Nanil+t)}. 

The last probability is bounded by 

(10) P{G{N) > Nan{l + t), Qn,*} + J, 
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where Qn,. = {||X,|| < K^, \Yi\ < K^}. 
On the set fl„ since 



sup J 
j3&B{N) 



l{X.'^f3,Y) - l{X^Po,Y) (1 - /n(X,y)) = 0, 
by the triangular inequahty, 



G{N)<Gi{N)+ sup £;[Z(X^/3,y)-/(X^/3o,n](l-^(X,y)) 
It fohows from Condition B that (jlOp is bounded by 



P{Gi{N) > iVa„(l+t) + o(g/n)}+nP{(X,y) ea^} 



The conclusion follows from Lemma [5l □ 



Proof of Theorem\M First of all, the target function EI{/3q + l3jXj,Y) is a 
convex function in f3j. We first show that if cov{b'(X.^(3*),Xj) = 0, then f3j^ 
must be zero. Recall EXj = 0. The score equation of the marginal regression 
at /3*^ takes the form: 

E {6'(Xj/3f = EiYX,) = E {b' [X.^ (3*)Xj} . 

It can be equivalently written as 

cov{b'(Xjpf),Xj) = cov{b'(X^f3*),Xj) = 0. 

Since both functions f{t) = b'{f3j Q + 1) and h{t) = t are strictly monotone 
in t, when t ^ 0, 

{/(*)- /(o)}(t-o)>o. 

If /3f /O, let t = /3fXj, 
/3f cov(/(/3f = E[E{f{t) - f{0)}{t - 0)\X, / 0] > 0, 
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M 



which leads to a contradiction. Hence /3j must be zero. 

On the other side, if = 0, the score equations now take the form: 

(11) E{b'i/3f-i)}=E{b'{X.^(3^)}, and 

(12) E {b'Wli)Xj} = E {b'{X.^(3*)X,} . 

Since b'{(3Yo) is a constant, we can get the desired result by plugging ([11 
into (Ha. □ 



Proof of Theorem\^ We first prove the case that b"{0) is bounded. By the 
Lipschitz continuity of the function 6'(-), we have 



\y{Pf^ + X,/3f ) - fe'(/3j"o)}X, < D,\pf\Xl 
Di = sup^, b"{x). By taking the expectation on both sides, we have 
i?{6'(/3fo + ^i/?f ) - b'{fi%)]X,\ < Z)i|/3f I, 

namely, 

(13) Di|/3f I > \coy{b\/3fj, + ),X,)|. 
Note that /3j^ and satisfy the score equation 

(14) E{b'{/3^^ + /3fXj) - b'{X.^f3*)}Xj = 0. 



It follows from ^ and EXj = that 



W\>D^'cin-\ 

The conclusion follows. 



We now prove the second case. The result holds trivially if |/3* | > cn for 
a sufficiently large universal constant c. Now suppose that |/3j^^| < cgn"", for 
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some positive constant cg. We will show later that |/3j^o~/^o^l — "-lo foi^ some 
cio > 0, where /3q^ is such that 6'(/3q^) = EY. In this case, if \Xj\ < n^, then 
the points /Sj^g and (/^jll, + Xjf3^^) faUs in the interval ± /i, independent 
of j, where h = cg + ciq. 

By the Lipschitz continuity of the function b'{-) in the neighborhood 
around /3q^, we have for \Xj\ < n'^, 

where D2 = max^g[^M_^ ^"(x). By taking the expectation on both 

sides, it follows that 

(15) \E{b'Wf^, + X,pf)-b'{pff,)}X,I{\X,\<nn\<D2Wf\. 
By using (fT4|l and EXj = 0, we deduce from (fTSl) that 

(16) D2\l3f\ > |cov(X^/3*,X,-)| - ^0 - ^1- 

where = E\h'{P^Q + XJ^/3j^-^)Xj| /(|Xj| > n'^) for m = and 1. Since 
|/3j^Q + Xjl3^^\ < a\Xj\ for \Xj\ > n'^ for a sufficiently large n, independent 
of j, by the condition given in the theorem, we have 

Am < EGialXjD'^lXjl I{\Xj\ > n"") < dn''', for m = and 1. 

The conclusion now follows from (jl6p . 

It remains to show that when |/3*''^| < cgn"'^ we have |/3j^ — /3q^| < cio- 
To this end, let 

m) = EWo + /3fXj) - y(/3o + /3fXj)}. 
Then, it is easy to see that 

/(/3o) = i?6'(/3o + /3fX,)-6'(/3o*0. 
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Observe that 

(17) |i?6'(/3o + (3fXj) - 6'(/3o)| < i^i + R2, 

where Ri = sup|^.|<^g„77-K |6'(/3o + x)-6'(/3o)| and R2 = 2EG{a\Xj\)I{\Xj\ > 
n^). Now, Ri = 0(1) due to the continuity of h'{-) and R2 = o(l) by the 
condition of the theorem. Consequently, by (|17p . we conclude that 

/(/3o) = 6'(/3o)-6'(/3o'') + o(l)- 
Since h'{-) is a strictly increasing function, it is now obvious that 

/(/3*' - cio) < 0, + cio) > 0. 

for any given cio > 0. Hence, |/3*q — /Jq"^] < ciq. □ 

Proof of Proposition [TJ Without loss of generality, assume that g{-) is 
strictly increasing and p > 0. Since X and Z are jointly normally distributed, 
Z can be expressed as 

Z = pX + e, 

where p = E{XZ) is the regression coefficient and X and e are independent. 
Thus, 

(18) Ef{Z)X = Eg{pX)X = E[g{pX) - g^X, 

where g{x) = Ef{x + e) is a strictly increasing function. The right hand side 
of (lisp is always nonnegative and is zero if and only if p = 0. 

To prove the second part, we first note that the random variable on the 
right hand side of (jlSp is nonnegative. Thus, by the mean-value theorem, we 
have that 

\Ef{Z)X\ > inf \g'{x)\ pEX^I{\X\ < c). 

\x\<cp 
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Hence, the result follows. □ 

Proof of Lemma [H By Chebyshev's inequality, 

P{Y >u)< exp(-soti)^exp(soy). 
Let e = Since Y belongs to an exponential family, we have 

E{exp{soY)\6} = exp{b{e + sq) - b{e)) . 

Hence 

P{Y >u)< exp(-sou)Sexp(6(X^/3* + sq) - 6(X^/3*)). 
Similarly we can get 

P(Y < -u) < exp(-son)Sexp(6(X^/3* - sq) - 6(X^/3*)). 
The desired result thus follows from Condition D by letting u = mof^/sQ. □ 

Proof of Theorem Note that Condition B is satisfied with kn defined in 
Section 5.2. The tail part of Condition B can also be easily checked. In fact, 

E[l{X.Jl3^,Y) - /(/3f ,y)](i - /„(x„y)) 

< \EbiXjp^)Ii\X,\ > Kn)\ + \Eb{Xjfif)I{\Xj\ > Jf„)| + Bifi^) + B{py), 

where B{f3j) = lEYXj f3j{l — I„(Xj,y))|. The first two terms is of order 
o(l/n) by assumption and the last two terms can be bounded by the expo- 
nential tail conditions in Condition D and the Cauchy-Schwartz inequality. 
By Theorem [U we have for any t > 0, 

P(V^|/3f - /3f I > 16(1 + t)kn/V) < exp(-2tV^n) + nmi exp(-moif°). 
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By taking l + t = caVn^/^-'' /{I6kn), it follows that 

P(|/3f - /3f I > can-") < exp{-c^n^-^y{knKnf) + nmi exp(-moK^). 

The first result follows from the union bound of probability. 
To prove the second part, note that on the event 

An = {max - /3f I < 0271-72}, 

by Theorem [3l we have 

(19) |/3f I > C2n-'^/2, for ah j G M^. 

Hence, by the choice of 7„, we have A^* C A^7„. The result now follows 
from a simple union bound: 

This completes the proof. □ 

Proof of Theorem O The key idea of the proof is to show that 

(20) = 0(||S/3*f ) = 0{A^ax(5:)}. 

If so, the number of {j : |/3*^| > en~'^} can not exceed 0{n^'' Amax(^)} for 
any e > 0. Thus, on the set 

Bn = { max |/3f - /3f | < en-^}, 

the number of {j : \f3j'^\ > 2en~'^} can not exceed the number of {j : |/3*^| > 
en~'^}, which is bounded by 0{n^'^ Ainax(S)}- By taking e = C5/2, we have 

The conclusion follows from Theorem [D^i). 

imsart-aos ver. 2007/09/18 file: SISGLIM-revise-v4.tex date: January 18, 2010 



46 



It remains to prove poll . We first bound /3f . Since b'(-) is monotonically 
increasing, the function 

{6'(/3fo + ^,/3f)-fe'(/3fo)}^./3f 

is always positive. By Taylor's expansion, we have 

{6'(/3fo + X,/3f ) - 6'(/3j^^)}/3f X, > DsWfXjfli\X,\ < K), 

where = ini\x\<K{B+i)f^"{^)j since (/3j^,/3j^^) is an interior point of the 
square B with length 2B. By taking the expectation on both sides and using 
EXj = 0, we have 



Eb'{/3f}, + XjPf)f3f'Xj > DsE{pfX,yi{\Xj\ < K). 

Since EXp{\Xj\ < K) = 1 - EX]l{\Xj\ > K), it is unformly bounded 
from below for a sufficiently large K, due to the uniform exponential tail 
bound in Condition D. Thus, it follows from ()12p that 



(21) 



\/3f\^ < Di\Eh'{y.^[3*)X. 



for some D4 > 0. 

We now further bound from above the right hand side of (I2ip by using 
var = 0(1). We first show the case where h"{-) is bounded. By the 

Lipschitz continuity of the function 6'(-), we have 



{6'(X^r ) - 6'(/3o*)}X, 



Xj'^i.iPX 



where Xm = (Xi, • • • ,Xj,J^ and f3\ = (/3f, . . . ,/3^J^. 

By putting the above equation into the vector form and taking the ex- 
pectation on both sides, we have 



(22) i^{6'(X^r)-6'(/3o^)}X 



M 



< Dl 



-E'Xj\/X|j/3^ 
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Using EXm = and var(X^ (3*) = 0(1), we conclude that 

for some positive constant Dq. This together with (j2ip entail ()20p . 

It remains to bound (|2ip for the second case. Since Xm = R^l' U, it 
follows that 

where — /^i- By conditioning on U, it can be computed that 

E{\J\P^V)=P^\J/\\P^fP^. 

Therefore, 

Eb\f3'o+X.ll3{)XM = Eb'{f3'o+(3^RV)R^y^l3^U/\\l3^f(32 

= £;6'(XV)(XV - /3S)sf /32/ll/32f- 

This entails that 

(23) ||i?fe'(xV)XMf = |i^6'(xV)(xV - /3o)l' ||s}/'/32||Vll/32f , 

By Condition G, \Eb' (X^ P*){X.^ p* - /3^)\ = 0(1). We also observe the facts 
that ||s}/^/32|| < XUL{'^)\\I32\\ and that WP^W = ||S^/2/3*|| is bounded. This 
proves (|20p for the second case by using (|2ip and completes the proof. □ 



Proof of TheoremlE If cov{b'{X^ /S"), Xj) = 0, by Theorem d we have 
/3f = 0, hence by the model identifiabihty at (3^^, /3f'^ = P^' . Hence, = 0. 
On the other hand, if L* = 0, by Condition C", it foUows that (3f = f3^ , 
that is, /3f^ = /3^-^ and /3f^ = 0. Hence by TheoremEl cov(6'(X^/3*), X^) = 0. 
□ 
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Proof of Theorem^ If |cov(6'(X^/3*), > cin"'^, for j G M^, by 
Theorem [3l we have miiij^j^^ ^ C2'n~'^. The first result thus follows 
from Condition C. 

To prove the second result, we will bound L*. We first show the case 
where h"[-) is bounded. By definition, we have 

(24) < ^1 < E[l{pj;^,Y) - /(Xj/3f ,y)}. 

By Taylor's expansion of the right hand side of (j24p . we have that 

(25) E[l{pfl Y) - liXj(3f', y)} < Z)5(/3f )^ for some D5 > 0. 

The desired result thus follows from (|24p , (j25p and the proof in Theorem [5] 
that 

IIL1I <0(||/3^-^f) = 0(A^ax(5])). 
Now we prove the second case. By the mean-value theorem, 

(26) i?{/(/3*iy) - /(xj/3f ,y)} = e{y - b'{P% + .X,/3f )}x,/3f , 

for some < s < 1. Since EYXj = Eb'(X.J Pj^)Xj, the last term is equal to 

(27) E{b'{X.J(3f) - 1/(13^^ + .X,/3f )}x,/3f . 

By the monotonicity of b'{-), when Xjf3j'^ > 0, both factors in ()27p is non- 
negative, and hence 
(28) 

{6'(Xj/3f ) - 6'(/3fo + sX,/3f )}x,/3f < {6'(Xj/3f ) - 6'(/3fo)}x,/3f . 

When Xj(3j^ < 0, both factors in (I28p are negative and (I28p continues to 
hold. It follows from ([26])— ([28]) and EXj = 0, the right hand side of ([26]) is 
bounded by 

(29) Eb'{X.Jf3f)Xjf3f = Eb'{X^(3'')Xj(3f. 
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Combining (j24p . (I26p and (I29p . we can bound ||L*|| in the vector form by 
Cauchy- Schwartz inequahty: 



|L*|| < 



^6'(xV)Xm 11/3^^^11 =0(||S/3'^|| 11/3 



Ml 



where (j23p is used in the last equahty. The desired result thus follows from 
Theorem [5l □ 



Proof of Theorem\^ To prove the result, we first bound Lj^n from below 
to show the strength of the signals. Let (3^ = {$q^ ,0)'^. Then, by Taylor's 
expansion, we have 

(30) 2L,, „ = 00 - - ^f) > hraini^j^f, 

where Aj^min is the minimum eigenvalue of the Hessian matrix 

ijiU = Pn6"(^nX,-)X,-Xj, 

^ M - M 

where ^„ lies between (3q and f3j . We will show 

(31) P{Aj-„,in > cii} = 1 - 0{exp{-cW-^)}, 

for some cn > and ci2 > 0. 

Suppose (f3T]) holds. Then, by p^ . we have 

P{ min |/3f I > C2n~'"/2} 
= 1-0 (s„{exp(-C4n^"^''/(/c„i<'„)^) + nmi exp(-moK")}) . 

This, together with (pOj) and (j3T]) . implies 

P{ min L, „ > cuc^n"'^'' /^} 
= 1-0 (s„{exp(-C4n^-2'^/(/c„i^'„)2) + nmi exp{-mQK^)}) . 
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Hence, by choosing the thresholding f„ = cjn~'^'^, for cj < cuc^/S, C 
with the probability tending to one exponentially fast, and the result 
follows. 

We now prove (I3ip . It is obvious that 

i'UU > min b"ix) P„XjXj/(|X,| < K), 



for any given K. Since the random variable i nvolved is 



in J, it follows from the Hoeffding inequality (iHoeffding 



unifor mly bounded 



1963!) that 

(32) P{|(P„ - P)X]l{\X,\ < K)\ >£}< exp(-2neV(4ii:"=)), 
for any k > and e > 0. By taking e = n~'^/^, we have 

P{|(P„ - P)X'yI{\Xj\ < K)\ > n-^/^} < exp(-2n^-7(4K2'=)). 
Consequently, with probability tending to one exponentially fast, we have 

(33) <'(^„) > min b"{x) EXjX.J I{\Xj\ < K)/2, 

^ \x\<{B+l)K J J Jt - 

The minimum eigenvalue of the matrix £^XjXj/(|Xj| < K) is 



min E(a'^ + 2a\/l - a'^Xi + (1 - a'^)Xf)I(\Xi\ < K). 

\a\<l ■' J 3/ w J\ J 

It is bounded from below by 

(34) min £;{a^ + (1 - a^)Xp{\Xj\ < K)} - 2\EXjI{\Xj\ < K)\ - K'^ , 
where we used P{\Xj\ > K) < K~^. Since EXj = and EXj = 1, 

\EXjI{\Xj\ < K)\ = \EXjI{\Xj\ > K)\ < K~^EXp{\Xj\ > K) < K~^. 
Hence the quantity in (|34p can be further bounded from below by 

EXfl(\XA <K) + min a'^EXh{\XA > K) - 2K-^ - K'^ 

J J' |a|<i 1 y' J' 



> l-supEXp{\Xj\> K)-2K-^-K- 
j 



■2 



imsart-aos ver. 2007/09/18 file: SISGLIM-revise-v4.tex date: January 18, 2010 



SIS IN GLIM 

The result follows from Condition G and (i33]l . 



51 



Proof of Theorem By (j30p , it can be easily seen that 

2ij,> < -C>iAmax(lPnXjXj) ||^q — Pj |p, 

where I?i = sup^, b"{x) as defined in the proof of Theorem [3l By ([32]) . with 
the exception on a set with neghgible probabihty, it follows that 

Amax(IPnXjXj) < 2Amax(-E'XjXj) = 2, 

uniformly in j. Therefore, with probability tending to one exponentially fast, 
we have 

(35) < Dr{0^ - + 
for some > 0. 

We now use (|35p to show that if Lj^n > cjn" -2'", then 1/3*^1 > -Dg^"", with 
exception on a set with negligible probability, where = {c^/ {2Dj)}^/'^ . 
This implies that 

with 7„ = Dsn~'^. The conclusion then follows from Theorem [5l 

We now show that Lj^n > cjn"'^'^ implies that |/3*^| > Dgn~'^, with ex- 
ception on a set with negligible probability. Suppose that < Dsn~'^. 
From the likelihood equations, we have 

(36) 6'(/3o^) = Y = P„6'(/3fo + X,). 

From the proof of Theorem IH with exception on a set with negligible prob- 
ability, we have |/3o^ - I3q^\ < cisn"'" and - l^j,o\ ^ cun"'^, for some 
constants C13 and C14. Since (/3(^^,0) and (/3*0'/^/^) interior points of 
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the square B with length 2B, it follows that with exception on a set with 
negligible probability, |/3*^| < B and |/3j^^| < B. Recall Di = sup^ 6"(x). By 
Taylor expansion, for some < s < 1, we have 

|P„6'(/3*-^ + /3fX,-)-6'(/3fo)l = \b"{Pfj, + s^fX,)^fF.^X,\ 
(37) < Z)i|/3fP„X,|=op(|/3f|), 

where the last step follows from the facts that EXj = and consequently 
¥nXj = o(l) with an exception on a set of negligible probability, by applying 
the Hoeffding inequality on and considering the exponential tail property 
of Xj. Hence, by (136]) and (f37l) . we have 

|6'(/3*0-&'(®)l=op(l/3fl). 

Let Dg = inf |^.|<2fi b"{x), with exception on a set with negligible probability, 
we have 

\b'0^')-b'{^ffo)\>D9\^^'-^f!o\- 
Therefore, we conclude that 

\P^-^f!o\=opi\^if\). 

By (j35p . we have > D^n~^. This is a contraction, except on a set 

that has a negligible probability. This completes the proof. □ 
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Fig 1. The boxplots of the minimum \t\-staUsUcs in the oracle models among 200 simula- 
tions for the first setting (SI) with logistic regression examples with f3* — (3,4, . . .)"^ when 
s = 12, 24, q = 15, 50, n = 600 and p — 2000. The triplets under each plot represent the 
corresponding values of (s, q, p) respectively. 
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